AD-A080  *07  AIR  FORCE  INST  or  TECH  WO  I  §HT  -FATTERSON  AFB  OH  SCHOO— ETC  W/%  ll/l 

CROSS  VALIDATION  OF  SELECTION  OF  VARIABLES  IN  NULTIPLE  REORESSl— ETC (U) 
OCC  79  JR  CAFARELLA 

UNCLASSIFIED  AFIT/00R/MA/79O-1  *. 


UNITED  STATES  Alt  FORCE 
AIR  UNIVERSITY 

MR  FORCE  INSTITUTE  OF  TECHNOIOGY 

Wrl*lit>Patt*r*OA  Air  F*r««  Ohio 


DISCLAIMER  NOTICE 


* 


THIS  DOCUMENT  IS  BEST  QUALITY 
PRACTICABLE.  THE  COPY  FURNISHED 
TO  DDC  CONTAINED  A  SIGNIFICANT 
NUMBER  OF  PAGES  WHICH  DO  NOT 
REPRODUCE  LEGIBLY. 

! 


CROSS  VALIDATION  OF 
SELECTION  OF  VARIABLES 
IN  MULTIPLE  REGRESSION 


&oR 

AFIT/«I^MA/79D-2 


THESIS 

Joseph  R.  Cafarella,  Jr. 
2  Lt  USAF 


Approved  for  public  release;  distribution  unlimited. 


&)_ . 

J  AFIT/GOR/MA/79D-2  , 


9 


a. 


— i 

i 


CROSS  VALIDATION  OF  SELECTION  OF 

<  f  * 

VARIABLES  IN  MULTIPLE  REGRESSION  ** 


JtsJ 


4 


UJ 


- - -7 

THESIS^  / 


l 


Presented  to  the  Faculty  of  the  School  of  Engineering 
of  the  Air  Force  Institute  of  Technology 
Air  University 

In  Partial  Fulfillment  of  the 
Requirements  for  the  Degree  of 
Master  of  Science 


Approved  for  public  release;  distribution  unlimited 


Table  of  Contents 


Acknowledgements  .  lv 

Abstract  . . .  v 

List  of  Figures  . vl 

List  of  Tables  . . . vli 

I  Introduction . . . . .  1 

Background  .  1 

Focus  of  This  Research .  3 

II  Concept  Overview  . . 6 

Theory  of  Least  Squares  Regression  .  6 

Assumptions  . 6 

Method  of  Least  Squares  .  7 

Measures  of  Merit  .  11 

III  Review  of  Fast  Research .  14 

IV  Model  Development  and  Selection  of  Variables  .  22 

The  Westinghouse  Data  Base  .  22 

Previous  Models  .  26 

Automatic  Interaction  Detection  .  32 

AID  Algorithm  and  Objective .  33 

Preparation  for  the  Use  of  AID  .  40 

Results  .  40 

V  Cross  Validation,  Conclusions  and  Recommendations  .  51 

Cross  Validation  . . 51 

Conclusions . . .  53 

Recommendations  . . 56 

Bibliography  .  57 

Appendix  A:  LRU  Description  .  60 

Appendix  B:  Part  1  Listing  of  Phase  I  Data . 66 

Part  2  Listing  of  Phase  II  Data  .  74 

Appendix  C:  Itemized  Input  for  AID  .  83 

Appendix  D:  Selected  AID  Output  . .  94 


ii 


Appendix  G:  Selected  SPSS  Output  .  103 

Appendix  F:  Selected  Leaps  and  Bounds  Output  . .  110 

Vita . 116 


Acknowledgements 

In  any  research  effort,  success  or  failure  depends  on  the  efforts 
of  many  individuals.  With  this  is  mind,  I  would  like  to  express  my 
sincere  gratitude  to  Dr.  David  R.  Barr  for  his  continued  support, 
expertise,  and  guidance  throughout  this  research. 

I  would  like  to  thank  Lt.  Colonel  Charles  W.  McNichols  who 
assisted  me  in  various  computer  problems  I  encountered,  and 
Lt.  Colonel  Saul  Young  who  acted  as  my  reader  on  my  thesis  committee. 

I  would  also  like  to  thank  Diane  Summers  and  Dan  Ferins  of  the 
Systems  Evaluation  Branch  (AAA-3)  of  the  AF  Avionics  Laboratory  for 
supplying  the  necessary  data  and  background  information  used  in  this 
research. 


iv 


t 


AFIT/GOR/MA/7»-2 


V 


Abstract 


Techniques  and  criterion  for  selection  of  the  "best"  subset  of 
variables  to  be  used  in  a  regression  model  are  reviewed. 

A  model  was  developed  using  the  Automatic  Interaction  Detection  (AID) 
algorithm  as  a  pre-screening  device  for  locating  those  variables  most 
important  to  the  regression  Including  interaction  terms. 

Five  previous  models  including  the  one  developed  by  AID  and  one 
developed  by  Westinghouse  on  avionic  characteristic  data  are  used  in 
cross  validation  experiments  to  determine  the  predictive  power  of  these 
models  on  a  new  set  of  data  points  using  the  same  set  of  variables. 

sfr-) 

A  cross  validation  R2  value  is  discussed  as  a  criterion  for  choosing 
between  competing  models. 
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CROSS  VALIDATION  OF  SELECTION  OF 


VARIABLES  IN  MULTIPLE  REGRESSION 


I  Introduction 


Background 

Long  term  DoD  planning  goals  require  than  operational  and  support 
costs  on  all  projects  be  reduced.  Managers  of  these  projects  are 
challenged  by  the  need  for  accurate  evaluation  of  these  projects  in 
the  early  design  stages.  A  question  arises,  however,  concerning  whether 
model  development  and  enhancement  should  be  contracted  out-of-house  or 
done  using  available  efforts  of  Air  Force  personnel  in-house.  Performing 
a  cost  analysis  in-house  would  surely  reduce  costs.  Also,  performing 
an  in-house  cost  analysis  would  benefit  the  user  of  the  model  by 
providing  first  hand  knowledge  of  the  impacts  of  updates  and  changes  in 
the  data  base  on  the  final  results  and  may  discover  intermediate 
results  unknown  to  a  contractor. 

One  prerequisite  for  the  user  to  perform  in-house  analysis  is  the 
availability  of  the  necessary  computer  packages.  Another  is  the 
knowledge  of  the  user  in  applying  other  effective  methods  of  analyzing 
the  goodness  of  fit  of  the  models  other  than  the  R2  value  or 
F-statistic  discussed  in  the  next  chapter.  Once  the  user  of  the  model 
attains  these  prerequisites,  in-house  analysis  can  be  performed. 

Since  these  prerequisites  for  an  in-house  capability  of  cost 
estimation  were  not  available  at  the  time,  the  Systems  Evaluation 
Branch  (AAA-3)  of  the  Air  Force  Avionics  Laboratory  at  Wright-Patterson 
Air  Force  Base  requested  that  the  Westinghouse  Electric  Corporation 
perform  a  regression  analysis  on  certain  characteristics  of  Line 


Replaceable  Avionic  Units  (LRUs) . 

The  Westinghouse  approach  was  to  select  "candidate"  LRUs  for  inclusion 
in  the  data  base,  collect  data  on  design  and  logistic  characteristics 
on  the  LRUs,  perform  a  regression  analysis  on  the  data,  then  use  the 
resulting  cost  and  parametric  relationships  to  construct  a  model.  The 
resulting  model  was  named  the  Avionics  Laboratory  Predictive  Operations 
and  Support  (ALPOS)  model  [36]. 

One  of  the  problems  Westinghouse  encountered,  which  most  analysts 
encounter  also,  involved  the  process  used  in  the  selection  of  the  data. 
Probably  the  most  important  element  in  the  research  is  the  nature  of 
the  data  which  was  used.  Many  different  situations  can  arise  from 
"bad"  data  and  wrong  assumptions  about  the  data  such  as  whether  the  data 
subset  collected  is  statistically  different  from  the  underlying 
population  or  whether  multicolinearity  exists  between  variables. 

In  the  initial  phase,  several  LRUs  were  identified  and  considered 
for  inclusion  in  the  data  base  from  a  wide  variety  of  avionic  units 
placed  on  various  types  of  aircraft.  The  LRU  selection  was  naturally 
constrained  by  the  availability  of  the  data  and  on  the  number  of  aircraft 
on  which  the  LRU  was  installed.  This  initial  data  base  (Phase  I) 
consisted  of  sixty-three  LRUs  from  seven  different  aircraft. 

For  their  regression  analysis,  Westinghouse  used  the  Linear 
Least-Squares  Curve  Fitting  Program  (LLSCFP)  developed  by  Daniel  and 
Wood  [8].  This  computer  program  uses  over  thirty  statistics  and  five 
types  of  plots  in  assisting  the  analyst  develop  meaningful  variable 
relationships . 
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In  his  Masters  thesis,  Captain  Larry  Pulcher  attempted  to  provide 
the  means  for  the  members  of  AAA-3  to  conduct  their  own  in-house  cost- 
estimation  analysis  by  developing  and  testing  criterion  for  selection 
of  variables  in  a  regression  analysis  including  iterative  techniques 
using  the  Statistical  Package  for  the  Social  Sciences  (SPSS) ,  all 
possible  regressions  using  the  International  Mathematical  Statistical 
Library  (IMSL)  routine  RLEAP,  and  the  Omnitab  computer  package  used  to 
compute  prediction  intervals. 

Both  Westinghouse  and  Pulcher  had  available  a  set  of  potential 
variables  which  could  be  considered  for  inclusion  in  the  model,  however, 
both  sets  of  variables  were  too  large  (more  variables  than  data  points) . 
Westinghouse  used  an  approach  in  which  "candidate"  variables  were  screened 
and  tested  before  admission  to  the  model.  Pulcher  used  a  screening 
technique  to  eliminate  certain  candidate  variables  before  hand. 

Focus  of  this  Research 

Westinghouse  has  recently  updated  the  data  collected  in  the  initial 
phase.  This  new  Phase  II  data  base  includes  sixty-five  additional  LRUs 
plus  six  previous  ones  placed  on  different  aircraft  for  a  total  of 
seventy-one  LRUs.  Also,  four  additional  aircraft  have  been  included. 

See  Table  I  for  a  summary  of  the  LRUs  investigated. 

One  objective  of  this  research  is  to  review  past  research  in  the 
area  of  selection  of  variables  in  a  regression  analysis  in  the  hope 
of  stimulating  thoughts  and  ideas  of  those  analysts  interested  in 
combining  talents  on  this  subject. 
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A  second  objective  of  this  research  Is  to  examine  the  three  previous 
models  developed  by  Pulcher  and  the  Phase  I  model  developed  by 
Westinghouse  and  determine  which  of  the  models  predicts  the  Phase  II 
data  the  best. 

A  third  objective  of  this  research  is  to  use  the  Automatic 
Interaction  Detection  (AID)  algorithm  documented  by  Sonquist  and 
Morgan  [33,  34]  to  prescreen  variables  from  the  entire  data  set  and 
create  a  model  based  on  the  Phase  I  data  and  perform  the  same  predictive 
tests  mentioned  above  using  the  Phase  II  data.  A  Leaps  and  Bounds 
algorithm  was  used  to  assess  various  AID  models  to  determine  which  one 
should  be  represented  in  the  subsequent  analysis. 

Finally,  updated  coefficients  were  calculated  for  the  best 
predictive  model  determined  in  objectives  two  and  three  above. 
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II  Concept  Overview 


Theory  of  Least  Squares  Regression 

The  fundamental  premise  of  a  regression  analysis  Is  to  build  a 
model  useful  In  predicting  a  single  dependent  or  criterion  variable  from 
a  set  of  independent  or  predictor  variables.  There  are  many  different 
types  of  models  which  can  be  created  such  as  general  linear  discussed 
In  the  following  section,  non  linear,  logarithmic,  polynomial,  reciprocal 
and  multiplicative.  This  research  deals  mainly  with  linear,  polynomial 
and  logarithmic  models. 

Assumptions 

Before  any  statistical  inferences  can  be  made  and  tests  performed 
on  the  significance  of  the  coefficient  estimates  and  the  independent 
variable,  certain  assumptions  must  be  made  about  the  data  and  about 
the  probability  distribution  of  the  random  error. 

The  first  assumption  Is  that  the  data  is  a  sample  from  the  target 
population.  The  second  assumption  is  that  the  random  variable  e,  the 
error  term,  is: 

(1)  statistically  independent 

(2)  identically  distributed 

(3)  from  a  population  with  zero  mean 

(4)  normally  distributed 

In  other  words,  e^NCO,  o2)  which  means  that  e  is  from  a  normal 
probability  density  function  with  a  mean  of  zero  and  a  variance 
of  o2.  Also,  since  nothing  is  known  about  the  probability  distributions 
describing  these  error  terms,  the  Central  Limit  Theorem  guarantees  that 
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if  we  can  assume  independence,  Chen  the  sum  will  tend  to  be  normally 
distributed.  Also,  if  we  can  assume  that  all  the  error  terms  have  identical 
probability  distributions,  then  we  insure  that  each  of  them  have  the 
same  variance. 

Method  of  Least  Squares 

The  general  form  of  the  linear  least  squares  model  is 

Y  ■  B0  +  +  62^2  +  +  •••  +  S^Xk+e  (1) 

where  Y  is  the  observed  value  of  the  dependent  variable 

Xj  is  the  observed  value  of  the  j  th  independent  variable 
S0  is  the  constant  term 

Bj  is  the  regression  coefficient  for  the  jth  independent  variable 
e  is  the  random  variable  accounting  for  the  error 
k  is  the  number  of  independent  variables 
Note  that  Xj  can  be  the  transformation  of  an  original  observation. 

For  example,  the  Product  of  Powers  model 

Y  -  B0  Xx  BlX2  02  (2) 

can  be  transformed  in  a  linear  sense  to 

ln(Y)  -  B0  +  61  lnCxp  +  B2  In^)  (3) 

or 

Y*  -  B0  +  Xx*  +  B2  X2*  (*) 

where  the  indicates  the  transformed  variable  in  equation  (4) . 

If  there  are  n  dependent  variables,  equation  (1)  can  be  written: 

Ya  -  80  +  BiXii  +  82X12  +  ...  +  SjXij  +  ...  +  BkXik  +  ei  (5) 

»  •  •  • 


where  i  ■  1,  2 


Since  it  is  very  difficult  to  discuss  the  multiple  regression  case 
in  algebraic  terms,  matrix  notation  will  be  used.  Equation  (5)  can 
be  written  as: 

Y  «  X6  +  £  (6) 

where  Y  represents  an  n-element  column  vector  of  observed  values  of  the 
dependent  variable: 

(7) 

X  represents  an  n  x  K  +  1  matrix.  The  first  column  contains 
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£  represents  the  n-element  column  vector  of  error  terms: 


el 

e2 


(10) 


The  objective  of  the  least  squares  technique  Is  to  fit  a  line 
through  a  set  of  data  points  so  that  the  sume  of  the  squared  differences 
between  Y^  (1  *  1,  2,...,n),  the  actual  values  of  the  dependent  variable, 
and  Y^,  the  estimated  value  of  the  dependent  variable,  is  minimized . 

Y  is  defined  algebraically  as: 

Y±  =  Bo  +  01  Xn  +  62Xi2  +  SjXij  +  ...  +  |3kXk  (11) 

or  in  matrix  notation  as: 

Y  “  X8  (12) 

The  random  error  term  e  is  the  difference  between  Jf  and  Y  and  can 
be  written  as  follows: 

£  -  Y  -  Y  (13) 

A  two-dimensional  graphical  depiction  of  a  regression  line  using 
three  data  points  is  shown  in  Figure  1. 

The  ideal  situation  is  to  have  each  of  the  error  terms  equal  to 
zero.  That  way,  the  regression  model  would  fit  the  data  points  exactly. 
In  most  cases,  however,  this  is  not  possible  so  minimizing  the  sum  of 
the  error  terms  is  the  best  solution.  In  order  to  keep  the  mathematics 
relatively  easy,  the  error  terms  are  made  positive  by  squaring  each  term 
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before  summation.  This  sum  of  squared  errors  (SSE)  can  be  written  as: 
n 

SSE  -  £  (ei)2  «  c'  e  (14) 

1-1 

where  s_'  Is  the  transposed  matrix  g_.  The  objective  can  now  be  stated 
as  follows: 

Find  £  to  minimize: 

SSE  -  £_'  £  -  (Y  -  £)  *  (Y  -  Y)  -  (Y  -  XB )  '  (Y  -  XB)  (15) 

Using  a  straightforward  application  of  Lagrange's  Multipliers  on 
equation  (15),  one  estimator  of  B.  which  minimizes  SSE  is: 

B  -  (X'X)-1  X'Y  (16) 


?3 

Y3 


£3 


^  =  B0  +  BA 


It  is  known,  however,  that  a  regression  model  containing  these 


estimates  of  J3  will  not  explain  all  of  the  variability  in  the 

dependent  variable  Y.  Some  of  the  variability  in  Y  will  be  explained 

by  the  regression  model  and  the  remaining  portion  is  left  unexplained. 

This  idea  can  be  stated  as  follows: 

SST  -  SSR  +  SSE  (17) 

where  SST  is  the  total  sum-of-squares  or  the  total  variability  in 

the  dependent  variable  and  is  defined  as: 

n  n 

SST  ■  £  (Yi  -  Y)2  «  2  Y2  -  nY  (18) 

i-1  i-1  1 


or 

SSR  *  8/  X*  Y  -  nY2  (21) 

SSE  is  the  residual  or  error  sum-of-squares  or  the  remaining 
amount  of  variability  which  is  left  unexplained  and  is  defined  by 
equation  (15) . 

Measures  of  Merit: 

Since  SST  depends  only  on  the  values  of  the  dependent  variables, 
Y^,  it  is  constant  for  any  given  set  of  n  observations.  Also,  since 
SSE  is  being  minimized,  this  makes  SSR  as  large  as  possible.  It  is 
then  reasonable  to  assume  that  the  ratio  of  SSR  to  SST  would  be  an 
adequate  indicator  of  the  goodness  of  fit  of  the  model  to  the  data 


and  a  good  measure  of  merit  of  the  regression.  This  ratio  is  denoted  as 


By,  or  simple  as  R2,  and  is  called  the  coefficient  of  determination  or 

the  multiple  R-squared  value. 

2  _  SSR  m  _  SSE  q  <  R2  <  (22) 

SST  SST  -  -  v  ’ 

According  to  Theil  [35:178],  the  sample  value  of  R2  is  somewhat 
biased  due  to  the  degrees  of  freedom  used  in  its  calculation.  Theil 
suggests  that  a  better  measure  of  merit  is  R2,  defined  as  the  adjusted 
multiple  correlation  coefficient. 

R2  -  1  -  (1-R2)  j  (23) 

or 

R2-  1  -  <1-r2)  (S-l  )  <“> 

if  a  constant  term  is  included  in  the  model,  or  equivalently  as 
R2-  R2-  (1-R2)  (25) 


In  either  of  the  cases  above,  R2  is  always  less  than  or  equal  to  R2 . 

It  must  be  noted,  however,  that  R2is  not  an  unbiased  estimator,  though 
it  still  has  some  merit  because  when  the  number  of  variables  being 
estimated,  k,  becomes  large  compared  to  the  number  of  observations  or 
data  points,  n,  it  still  gives  an  optimistic  picture  of  the  amount  of 
variability  in  the  dependent  variable  explained  by  the  regression  model. 

R2  can  also  be  defined  as: 


R2 


1  - 


MSE 

MST 


(26) 


.  SSE 
n-k-1 

SST 


where  MSE,  mean  square  error 

and  MST,  mean  square  total  ■  , 

n-1 

Thus,  MSE  ”  MST*  (1-R2),  and  minimizing  MSE  maximizes  R2. 


12 


Mosier  [26]  has  suggested  a  measure  of  merit  similar  to  R2which 

measures  the  predicting  power  of  a  model.  Based  on  a  model  using  the 

original  set  of  old  data  (Phase  I) ,  the  estimated  value  for  each  data 

point,  Y^,  was  calculated.  The  cross  validation  SSE  (c.v.  SSE)  was 

n 

then  calculated  by  the  following  equation:  c.v.  SSE  *  I  (Y^  -  Y^)2, 

i-1 

where  the  Yjs  are  the  actual  (observed)  values  from  the  new  set  of 
data  (Phase  II).  Notice  that  the  c.v.  SSE  is  not  the  same  as  SSE 

A 

because  both  Y^  and  Y^  did  not  come  from  the  same  sample. 

The  c.v.  SSE  is  then  used  to  calculate  the  cross  validation  R2 

by  c.v.  R2  «  1  -  •  Here,  c.v.  R2  indicates  the  predictive 

bb  1 

power  of  the  old  models  on  the  new  data. 
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Ill  Review  of  Past  Research 


There  is  a  considerable  amount  of  literature  examining  the  many 
efforts  that  have  been  made  to  determine  the  "best"  subset  of  independent 
variables  that  should  be  included  in  a  regression  model  so  that  the 
amount  of  unexplained  variance  in  the  dependent  variable  is  reduced. 

Many  criteria  for  selection  of  these  variable  subsets  have  been  examined, 
yet  no  one  best  criterion  has  been  found. 

Draper  and  Smith  [10:163]  point  out  two  conflicting  viewpoints 
on  this  subject.  At  one  extreme,  all  variables  could  be  included  in 
the  model  for  predictive  purposes,  however,  though  the  values  predicted 
may  be  reliable,  as  the  number  of  variables  in  the  model  approaches  the 
number  of  data  points  or  observations,  R2  will  naturally  become  close 
to  one,  thus  implying  a  false  sense  of  importance  of  the  model  to  the 
unexperienced  analyst. 

At  the  other  extreme,  the  model  could  include  as  few  variables 
as  possible  so  that  the  predictions  are  still  reliable  and  the  costs  of 
maintaining  and  updating  the  data  base  is  kept  at  a  minimum.  A 
compromise  between  these  two  viewpoints  is  suggested  and  is  considered 
to  be  the  "best"  approach. 

lr 

One  would  like  to  examine  all  of  the  2  possible  regressions  of 
the  dependent  variable  in  the  search  for  the  best  equation,  however, 
not  only  would  there  be  computational  and  time  limitations  on  the 
computer  which  make  this  approach  impractical,  but  there  is  the 
remaining  problem  of  specifically  defining  what  is  meant  by  the  "best" 
regression  model  and  when  it  has  been  found.  This  chapter  reviews 
some  of  the  research  that  has  been  done  in  this  subject  area. 


Probably  the  most  well  known  research  on  the  subject  of  variable 
selection  and  regression  analysis  is  that  of  Draper  and  Smith.  Four 
different  regression  approaches  have  been  devised  including  all 
possible  regressions,  backward  elimination,  forward  selection,  and 
stepwise  regression. 

In  the  All  Possible  Regressions  technique,  all  2  possible 
regressions  are  considered.  Thus  a  ten  variable  model  would  require 
the  examination  of  21®  or  1024  possible  regressions.  Each  model  is 
ordered  by  some  criterion  such  as  R2  or  R2  and  compared.  Often  for 
large  data  bases,  it  becomes  necessary  to  compute  the  residual  mean 
square  error  and  assess  its  magnitude  to  determine  the  best  cut-off 
point  for  the  total  number  of  variables  in  the  regression. 

Recent  research  by  analysts  such  as  Schatzoff,  Tsao,  and  Fienbert  [31] 
have  been  able  to  reduce  the  number  of  calculations  required  from  an 
order  of  k3  to  k2,  thus  making  this  technique  more  practical,  yet  still 
relatively  expensive  to  use.  However,  if  the  number  of  variables  was 
reduced  by  methods  such  as  the  Chow  test  developed  by  Gregory  Chow  [7] , 
this  method  becomes  even  more  practical. 

In  the  backward  elimination  method,  a  regression  equation  containing 
all  possible  variables  is  used  as  a  starting  point.  A  partial-F  value 
is  calculated  for  each  variable  and  if  a  value  is  less  than  some 
specified  tabular  value,  then  that  variable  is  removed  from  the  model. 

Once  a  variable  is  removed  from  the  model  it  is  not  susceptible  to 
further  consideration.  A  new  regression  is  then  computed  and  the 
process  continues  until  no  more  variables  can  be  eliminated  from  the 
model.  Although  this  method  is  not  thought  of  as  the  most  powerful 
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methods  to  use  ia  determining  the  best  regression  equation,  Mantel  [23] 
supports  the  method  and  points  out  its  many  advantages. 

The  forward  selection  process  operates  in  a  reverse  manner  from 
the  backward  elimination  procedure.  Variables  enter  the  model  one  at 
a  time  until  a  model  has  been  satisfied.  Initially,  partial-F 
statistics  and  partial  correlation  coefficients  are  calculated  between 
each  independent  variable  and  the  dependent  variable.  The  variable 
most  highly  correlated  will  enter  the  regression  equation.  A  new 
regression  equation  is  then  calculated  and  the  process  continues.  Once 
a  variable  has  entered  the  regression  equation,  there  is  no  chance  that 
it  will  be  removed.  This,  however,  is  one  of  its  faults.  There  is  no 
attempt  to  determine  the  effect  an  entering  variable  has  on  the  existing 
variables  in  the  model. 

In  the  stepwise  regression  procedure,  however,  an  examination  is 
made  at  each  stage  of  inclusion  of  variables  in  the  model  to  determine 
whether  any  variable  or  set  of  variables  introduced  previously  lose 
their  significance  due  to  the  introduction  of  a  new  variable.  Thus, 
a  variable  which  entered  at  an  earlier  stage, yet  has  been  found 
unimportant  due  to  the  inclusion  of  a  new  variable, will  be  detected 
and  removed  from  the  model.  For  this  reason,  the  stepwise  procedure 
has  been  determined  to  be  the  most  powerful  regression  technique. 

In  discussing  various  regression  procedures,  there  are  three 
important  points  that  need  mentioning.  The  first  point  is  that  the 
order  of  inclusion  of  the  variables  in  the  model  is  irrelevent.  Thus, 
a  variable  which  entered  early  in  the  model  does  not  mean  that  it  is 
more  important  than  a  variable  which  entered  later.  The  second  point 
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is  that  there  Is  no  guarantee  that  any  of  the  previous  methods  will 
arrive  at  the  best  regression  model.  The  third  point  is  that  there  is 
also  no  guarantee  that  each  of  the  previous  methods  will  arrive  at 
the  same  model  or  subset  of  variables.  This  is  true  between  any  set 
of  regression  procedures. 

There  are  many  more  criterion  for  selecting  variable  subsets 
other  than  R2  or  the  partial-F  statistic.  The  remaining  portion 
of  this  chapter  is  dedicated  to  mentioning  those  various  research  efforts. 

Aitken  [1]  discusses  the  use  of  the  Mean  Square  Prediction  Error 
(MSPE)  as  a  criterion  for  selecting  variable  subsets  if  the  regression 
equation  is  used  for  prediction  purposes  rather  than  description  purposes. 
In  the  later  case,  he  prefers  the  use  of  the  conventional  R2  value  as 
a  criterion.  Allen  [2]  also  discusses  the  use  of  the  MSPE  for  selecting 
variable  subsets. 

The  MSPE  is  defined  as  the  expected  value  of  the  squared  difference 
between  the  actual  value  of  the  independent  variable,  Y,  and  the 
estimated  value,  Y.  If  all  dependent  variables  are  used  in  the  regression 
equation,  Aitkin  defines  the  MSPE  as  follows: 

MSPE  -  E[Y  -  Y]2  -  a2  [  ~  +  (*  -  £, »  S^  (x  -  x)  ]  (27) 

where  X  is  a  row  rector  of  X,  x  is  the  vector  of  means,  and  SH  is  the 
matrix  of  cross  products  of  the  k  independent  variables:  Sxx  *  X'X. 

Allen  defines  the  MSPE  as  follows: 

A  A  A 

MSPE  -  E[Y  -  Y]  -  a2  +  Var(Y)  +  [E(Y)  +  X  £]2  (28) 

where  the  last  term  is  the  squared  bias  of  prediction  and  the  last 

A 

two  terms  together  are  the  Mean  Square  Error  (MSE)  of  Y. 


Since  Che  least  squares  predictor  Y  is  unbiased,  its  variance  is 

£(X'X)  x  <j2.  If  the  last  term  is  dropped,  one  gets: 

MSPEr  -  a2  +  x  (X’xr-x  a2  (29) 

which  Allen  uses  for  the  comparison  of  other  predictors. 

Kennedy  and  Bancroft  [22]  discuss  using  the  average  value  of  the 

MSPE  over  their  sample  as  a  criterion: 

1  n  +1 

MSPE*  =  ±  z  °2  [  n  +  (Xi  ~  *>’sxx~  (*i  "  *)]  (30) 

i*=l 

=  ^  (n  +  k  -  1) 
n 


where  X  has  been  assumed  to  follow  a  uniform  distribution.  Aitken, 
however,  believed  it  more  realistic  to  assume  that  all  X  values  were 
independently  and  identically  distributed.  In  either  case,  the 
objective  is  to  chose  the  variable  subset  which  minimizes  the  MSPE. 
If  the  subset  of  variables  to  be  tested  is  specified  in  advance  or 
simply  fixed,  the  testing  hypothesis  becomes: 

Hc  :  MSPE  -  MSPE1  _>  0 


(31) 


Ha  ;  MSPE  -  MSPE  <  0 

where  MSPE1  is  the  MSPE  of  the  variable  subset.  If  the  null  hypothesis, 
Hq,  is  not  rejected,  this  means  that  the  subset  of  variables  is  not 
statistically  different  from  that  of  the  total  set  of  data  and  the 
subset  may  be  considered  for  use  in  a  prediction  equation.  A  non¬ 
central  F-statistic  and  test  have  also  been  developed  by  Aitken  to 
estimate  (31)  depending  on  the  assumed  distribution  and  selection 
process  of  the  independent  variables.  In  the  cases  where  the  variable 
subsets  are  unknown,  a  simultaneous  procedure,  similar  to  the  forward 
selection  process  developed  by  Draper  and  Smith,  was  developed  by 
Garland  [15].  In  this  procedure,  variable  subsets  are  chosen  based  on 
a  central-F  approximation  to  the  multiple  correlation  coefficient. 
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Helms  [16]  discusses  the  use  of  the  Average  Estimated  Variance 
(AEV)  as  a  criterion  for  comparing  competing  linear  models  and  explains 


why  the  Integrated  Mean  Square  Error  (IMSE)  used  as  a  criterion  is 

not  very  useful  in  practice.  The  technique  includes  the  computation 

of  the  AEV  for  each  possible  regression  and  the  implementation  of  a 

stepwise  procedure  using  the  AEV  as  a  criterion  rather  than  R2  or 

Mallows'  C  statistic.  One  advantage  of  the  AEV  has  over  R2  and  Cn 
P  H 

is  that  it  automatically  incorporates  information  about  the  tradeoff 
between  bias  and  variance  when  one  enters  or  deletes  variables  in  the 
model . 

Furnival  and  Wilson  [13]  discuss  a  technique  for  computing  the  error 
sum  of  squares  (SSE)  for  all  possible  regressions  with  minimal  amount 
of  calculations,  and  show  how  it  is  implemented  in  a  branch  and 
bound  technique  which  they  refer  to  as  the  Leaps  and  Bounds  technique. 
This  technique  is  useful  in  determining  the  best  subset,  and  without 
examining  all  the  possible  subsets  of  variables. 

The  fundamental  principal  upon  which  their  research  is  based  is 
that  SSE (A)  SSE(B)  where  A  is  any  set  of  independent  variables  and 
B  is  a  subset  of  A.  In  other  words,  it  is  impossible  for  any  subset 
of  A  to  have  a  lower  error  sum  of  squares  than  A.  Because  of  this, 

SSE (A)  can  be  used  as  a  lower  bound  in  the  analysis  which  means  that 
subsets  of  A  can  be  ignored  in  the  search  for  the  best  given  numbered 
variable  subset. 

In  their  technique,  two  search  variations  are  described:  horizontal 
and  vertical.  The  horizontal  variation  explains  regressions  in  a 
probability  tree  form  and  in  a  conventional  or  natural  order  so 
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that  all  one  variable  regressions,  two  variable  regressions,  etc.  are 
easily  observable.  These  regression  trees  are  formed  by  beginning 
with  all  Tf.  variables  in  a  regression  and  branching  out  on  all  possible 
fc-1  variable  subsets.  The  value  of  SSE  is  computed  for  each  of  the 
subsets  and  the  subset  with  the  smallest  value  will  be  the  "best" 
k-1  variable  subset.  That  subset  will  not  be  divided  further  as  it 
provides  a  minimum  value  for  that  branch.  Branching  occurs  elsewhere 
in  the  same  manner  as  above  until  the  best  possible  k-2,  k-3,  ...,  1 
variable  subsets  are  chosen. 

Criterion  for  selecting  these  variable  subsets  is  based  on  either 
R2,  R2,  or  Mallows  Cp  statistic.  In  a  similar  fashion,  Narula  and 
Wellington  [25]  introduce  a  branch  and  bound  algorithm  using  the 
Minimum  Sum  of  Weighted  Absolute  Errors  (MSWAE)  as  a  criterion  for 
selecting  variable  subsets  and  involves  the  use  of  linear  programming 
to  minimize  the  sum  of  the  absolute  values  of  the  residuals  subjected 
to  several  constraints. 

Andrews  [4]  discusses  the  use  of  regression  and  model  building 
by  medians  and  also  introduces  a  robust  method  of  analyzing  data 
assumed  not  to  have  a  Gaussian  distribution  with  errors  of  equal 
variances . 

Webster,  Gunst,  and  Mason  [37]  discuss  a  modified  least  squares 
estimation  procedure  using  latent  roots  and  latent  vectors  of  the 
correlation  matrix  of  the  dependent  and  Independent  variables.  This 
has  been  found  to  be  very  useful  when  the  matrix  of  Independent 
variables  is  nearly  singular. 
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In  a  more  recent  article,  Park  [29]  discusses  a  strategy  for 
selecting  subsets  of  variables  from  a  given  linear  mixture  model 
developed  by  Scheffe  [32],  and  applies  the  MSE  as  a  criteria  for 
screening  the  variables  for  model  reduction. 

In  another  recent  article,  Ellerton  [11]  investigates  a  method 
of  applying  linear  programming  to  determine  whether  a  given  subset 
of  variables  is  adequate  in  a  regression  model. 

Surprisingly  enough,  very  little  cross-communication  has  been 
done  concerning  this  very  important  subject,  and  I  believe  a 
joint  analytical  effort  should  be  made  testing  these  various  criteria 
against  various  data  bases  in  order  to  determine  if  there  is  one 
best  method  or  criterion  useful  in  predicting  variable  subsets  to 
be  used  in  a  regression  model. 
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IV  Model  Development  and  Selection  of  Variables 


The  Westinghouse  Data  Base 

Senior  engineers  from  Westinghouse  collected  most  of  the  data  In 
both  Phase  I  and  Phase  II  from  on  site  visits  to  the  Pentagon, 

AFLC  Headquarters,  ATC  Headquarters,  four  Air  Logistic  Centers  (ALCs), 
and  several  Air  Force  bases.  While  on  site,  interviews  were  conducted 
with  technicians  to  verify  the  appropriateness  of  the  LRUs  originally 
selected  and  to  identify  possible  alternatives. 

At  the  completion  of  the  Phase  II  data  collection,  the  resulting 
data  base  contained  134  LRUs  (See  Appendix  A) ,  and  thirty-three  elements 
(variables  plus  indicators  per  LRU)  (see  Table  II) .  After  various 
variable  transformations  and  modifications,  twenty  variables  remained. 

The  first  set  of  variables  describe  the  aircraft  type  and  avionics 
area  and  are  indicators  (zero  or  one).  Three  aircraft  types  including 
fighter,  bomber  and  cargo  and  three  avionic  areas  including  sensory, 
communication  and  navigation  were  initially  coded  as  follows: 

Bomber  1  0 

Cargo  0  1 

Fighter  0  0 

Sensory  1  0 

Communication  0  1 

Navigation  0  0 

After  additional  investigation,  the  following  set  of  indicator 
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Bomber  indicator  variable  (1  indicates  Bomber  aircraft) 

Cargo  indicator  variable  (1  Indicates  Cargo  aircraft) 

Sensory  indicator  variable  (1  indicates  sensory  avaionlcs) 
Communications  Indicator  variable  (1  indicates  comm  avionics) 
Unit  Price 
Volume  (in3) 

Weight  (lbs) 

Component  Count 
Percentage  Digital  Components 
Percentage  Analog  Components 
Percentage  Electro-Mechanical  Components 
Percentage  Power  Supply  Components 
Percentage  Transmitter  Components 
Percentage  Solid  State  Components 
Power  Dissipation  (watts) 

Utilization  Factor  (Operating  hours/flying  hour) 

Percentage  Failures  Detected  by  Automatic  Test  (BIT/FIT  FACTOR) 

Number  of  Integrated  Circuits 

Number  of  SRUs  in  the  LRU 

Mean  Time  (flight  hours)  Between  Failures 

Mean  Time  (flight  hours)  Between  Maintenance  Actions 

Maintenance  Manhours  -  Scheduled  (Organizational) 

Maintenance  Manhours  -  Unscheduled  (Organizational) 

Maintenance  Manhours  -  Shop  (Intermediate) 

Logistic  Support  Cost  -  Field 

Logistic  Support  Cost  -  Special  Repair  Center  (Depot) 

Logistic  Support  Cost  -  Packaging  and  Transportation 
Logistic  Support  Cost  -  Condemnation  Replenishments 
Training  Costs 

Percentage  LRUs  Not  Repairable  This  Station  (%NRTS) 

Flying  Hours  (FH)  (to  normalize  MMH  and  LSC) 

Percentage  Condemned  LRUs 

Specialized  Repair  Activity  (Depot)  Costs 

Quantity  per  Assembly 

Flying  hours  (to  normalize  Training  costs) 


variables  was  used  in  the  regression  analysis  to  denote  interactions 
between  aircraft  type  and  avionics  area: 

LRUs  in  fighter  aircraft  navigation  systems 
LRUs  in  fighter  aircraft  sensory  systems 
LRUs  in  fighter  aircraft  communication  systems 
LRUs  in  bomber  aircraft  navigation  systems 
LRUs  in  bomber  aircraft  sensory  systems 
LRUs  in  bomber  aircraft  communication  systems 
LRUs  in  cargo  aircraft  navigation  systems 
LRUs  in  cargo  aircraft  communication  systems 
LRUs  in  cargo  aircraft  sensory  systems  were  not  included.  The  above 
set  of  indicators  is  coded  as  follows: 

Fighter-Navigation  1000000 

Bomber-Navigation  0100000 

Cargo-Navigation  0010000 

Fighter-Sensory  0001000 

Bomber-Sensory  0000100 

Fighter-Communication  0000010 

Cargo-Communications  0000000 

The  next  four  Independent  variables  are  measures  of  physical 
characteristics.  The  Unit  Price  is  measured  in  1976  dollars  per  LRU 
and  ranges  in  value  from  $153  to  $220,943.  The  Volume  is  measured 
in  cubic  inches  and  ranges  in  value  from  30  to  8200.  The  Weight 
is  measured  in  pounds  and  ranges  in  value  from  one  pound  to  8200  pounds. 
Component  Count  is  the  number  of  electronic  components  and  ranges  in 
value  from  none  to  7638. 
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The  next  five  Independent  variables  are  categories  of  the 
different  component  types  Including  Digital,  Analog,  Electromechanical, 
Power  Supplies,  and  Transmitter,  and  are  measured  as  a  percentage  of 
the  total  number  of  components  having  that  characteristic.  All 
values  range  from  zero  to  100  percent. 

The  next  independent  variables.  Fraction  Solid  State,  and  the 
number  of  Integrated  Circuits  in  each  LRU  are  measures  of  LRU 
technology,  the  later  ranging  in  value  from  zero  to  4625. 

The  sixteenth  independent  variable  is  a  measurement  the  Power 
Dissipation  and  is  defined  as  the  input  power  minus  the  transmit  power, 
and  ranges  in  value  from  six  to  1640  watts. 

The  next  independent  variable  represents  a  percentage  of  failures 
in  LRUs  detected  by  the  Built- In-Test/Fault-Isolation-Test  (BIT/FIT) . 

The  last  two  independent  variables  are  the  Specialized  Activity 
(Depot)  Costs  and  the  Quantity  Per  Assembly. 

Westinghouse  also  identified  several  dependent  variables.  These 
include  the  Mean  Time  Between  Failures  (MTBF) ,  the  Mean  Time  Between 
Maintenance  Actions  (MTBMA) ,  the  Total  Maintenance  Man  Hours  per 
Operating  Hour  (MMH-UNS/OH) ,  the  Maintenance  Man  Hours  in  the  Shop 
per  Operating  Hour  (MMH-SHOP/OH) ,  the  Total  Logistic  Support  Costs 
per  Operating  Hour  (LSC-T0T/0H) ,  the  Field  Logistic  Support  Cost  per 
Operating  Hour  (LSC-FLD/OH) ,  the  Training  Costs  per  Operating  Hour 
(TRAIN/OH) ,  and  the  percentage  of  LRUs  not  repairable  this  station 
(NRTS) . 

Only  one  of  the  dependent  variables  mentioned  above  will  be  used 
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in  the  analysis;  LSC-TOT/OH.  A  list  of  all  the  variables  used  in 
this  report  and  previous  reports  is  contained  in  Table  III. 

Previous  Models 

In  this  section,  five  previous  models  (two  developed  by 
Westinghouse  and  three  developed  by  Pulcher)  are  discussed. 

The  first  Westinghouse  model  (Table  IV)  was  based  on  the  Phase  I 
data  and  second  (Table  V)  was  based  on  the  Phase  II  data.  All 
variables  in  the  first  model  are  in  linear  form,  quadratic  form  or 
logarithmic  form. 

The  three  models  developed  by  Pulcher  are  described  in  Table  VI 
and  Table  VII.  Initially,  Pulcher  was  able  to  create  ninety-seven 

variables  from  the  Product  of  Powers  model  of  the  form: 

13  6  6  13 

In  Y  -  a0  +  Z  aiDi  +  I  g1oln  x,  +  I  Z  g, ,  Diln  X-.  (31) 

1-1  j-1  J  J  J-l  1=1  3  3 

The  Di  are  indicator  variables,  and  their  function  Is  to  allow 
for  coefficients  to  be  different  for  subpopulations.  For  a  simplified 
example,  suppose  we  had: 

In  Y  =  a0  +  a]Di  +  g^lnX^  +  BllDilnXx  (32) 

For  the  subpopulation  for  which  D*  *  0,  the  model  is: 

In  Y  =  aQ  +  ejlnXx  (33A) 

while  for  the  subpopulation  for  which  Di  =  1,  the  model  is: 

In  Y  ■  (oQ  +  ax)  +  (@x  +  Bn)  lnXx  (33B) 

Since  there  were  only  63  data  points,  a  method  was  needed  to  reduce 
the  number  of  variables.  Pulcher  chose  the  Chow  Test  (also  called  the 
Test  of  Equality  Between  Subsets  of  Coefficients  in  Two  Regressions) , 
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TABLE  III 


List  of  Variables  -  Abbreviations 


Name 

Westinghouse 

Pulcher 

This  Report 

Bomber 

IBOM 

* 

BOMBER  i 

Cargo 

ICAR 

* 

CARGO 

Sensory 

ISEN 

* 

SENSORY 

Communication 

ICOM 

* 

COMM 

Navigation-Fighter 

* 

* 

FGTNAV 

Navigation-Bomber 

IBMNAV 

* 

BOMNAV 

Navigation-Cargo 

* 

* 

CARNAV 

Sensory-Fighter 

* 

SF 

FGTSEN 

Sensory  -  Bomber 

* 

SB 

BOMSEN 

Communication  -  Fighter 

IFGCOM 

CF 

FGTCOM 

Communication  —  Bomber 

IBMCOM 

CB 

BOMCOM  ! 

Communication  -  Cargo 

* 

COMMC 

CARCOM 

Unit  Price 

UP 

UP 

UP 

Volume 

V 

V 

V 

Weight 

w 

W 

W 

Component  Count 

cc 

CC 

CC 

Component  Density 

CD 

* 

* 

Power  Dissipation 

PD 

PD 

PD 

Fraction  Solid  State 

FSS 

%  SS 

SS 

Fraction  Digital 

FDI 

%  DIG 

DIG 

Fraction  Analog 

FAN 

%  AN 

AN 

Fraction  Electromechanical 

FEM 

%  EM 

EM 

Fraction  Power  Supply 

FPS 

Z  PS 

PS 

Fraction  Transmitter 

RXR 

%  XMTR 

XMTR 

Fraction  BIT/FIT 

BIT/FIT 

BF 

BITFIT 

Number  of  Integrated  Circuits 
Specialized  Repair 

IC 

SRA 

* 

IC 

Activity  Costs 

•k 

SRU 

Quantity  Per  Assembly 

QPA 

k 

QPA 

Logistic  Support  Cost/ 

Operating  Hour 

LSC/OH 

LSC/OH 

LSC/OH 

Maintenance  Manhours/ 

Operating  Hour 

MMH/OH 

* 

* 

Mean  Time  Between  Failures 

Mean  Time  Between  Maintenance 

MTBF 

MTBMA 

* 

* 

Actions 

k 

* 

Training  Cost /Operating  Hour 

TRAIN/OH 

* 

* 

Not  Repairable  This  Station 

NRTS 

* 

* 

*  Not  used  in  the  analysis 
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TABLE  IV 


Westinghouse  Model  -  Phase  I  Data 


21 

In  (LSC/OH)  =  b0  +  £  b*  X* 

1=1 

R2  =  .8916 

R2  =  .9283  F-value  -  25.. 

i 

1 

bi 

xi 

Partlal-F 

0 

-8.15108 

1 

3.86111 

(IBOM-. 2857142857) 

36.0 

2 

3,66533 

(ICAR-. 2698412698) 

31.4 

3 

-4.85271  x  10“ 1 

(ISEN-. 2539682540) 

3.6 

4 

-2.56663 

(IB0M-. 2857142857) (ISEN-. 2539682540) 

37.2 

5 

-1.66262 

(IBOM-. 2857142857) (IC0M-. 206349206) 

12.2 

6 

-7.67253  x  10-1 

(ICAR-. 26984 12698) (IC0M-. 206349206) 

3.2 

7 

1.27356  x  10“2 

FPS 

6.8 

8 

2.25967  x  10“2 

(FAN-63.349) 

36.0 

9 

-7.42999  x  10"  3 

(FSS-61.138) 

9.0 

10 

2.38503 

(UF-1.639 

27.0 

11 

-9.20384  x  10-11 

(UP- 133606. 3) 2 

25.0 

12 

-1.52864  x  10“A 

(W-64. 314)2 

8.4 

13 

-1.07105  x  10“3 

(FAN-48.895)2 

33.6 

14 

1.20418  x  10“3 

(FEM-46.991)2 

33.6 

15 

7.10025  x  10“4 

(FXR-40.172)2 

10.9 

16 

-1.61651  x  10“4 

(FSS-51. 898)2 

2.2 

17 

-1.11568  x  10“6 

(PD-722. 249)2 

7.3 

18 

5.009996 

(UF-1.681)2 

42.2 

19 

1.70042  x  10“3 

(BF-27.288)2 

13.0 

20 

4.60293  x  10_1 

ln(UP) 

31.4 

21 

2.35583  x  10“1 

Xn(V) 

4.8 
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TABLE  V 


Westinghouse  Model  -  Phase  II  Data 


18 

In  (LSC/OH)  -  b0  +  I 
1=1 


R2  =  .8827  F-Value  =41.0 


B 

bi 

Xi 

Partial-F 

0 

-6.97950 

B 

7.85143  x  10-1 

IFGCOM 

10.24 

Q 

1.14876 

IBMNAV 

34.81 

Bfl 

1.07719 

IBMCOM 

21.16 

H 

1.91500  x  10“ 1 

CD 

12.25 

B 

-1.22007  x  10~2 

FDI 

37.21 

-1.72307  x  lO-2 

FEM 

24.01 

-9.49029  x  10~3 

FXR 

4.84 

8 

-8.36154  x  10-3 

FSS 

9.61 

9 

-3.35635  x  10“4 

(V-1333.0) 

9.00 

10 

1.98641  x  10-2 

CW-32.3) 

17.64 

11 

6.72953  x  10-8 

(V-3222.0)2 

6.25 

12 

-1.05350  x  10-4 

(W-65.3)2 

4.00 

13 

-4.24991  x  10-8 

(CC-2986) 2 

5.76 

14 

-4.36525  x  10-4 

(FPS-45.48)2 

9.61 

15 

7.79704  x  10-1 

(UF-1.72)2 

16.81 

16 

5.64131  x  10-1 

ln(UP) 

94.09 

17 

4.61602  x  10" 1 

ln(V) 

8.41 

18 

1.47264  x  10"1 

In (PD) 

6.25 

TABLE  VI 

Pulcher’s  SPSS  Model  -  Phase  I  Data 


- -  —  -  -  •  - 1 

R2  =  0.95212 

R2  =  0.92388 

F  =  33.72 

In  (LSC/OH)  =  aQ  +  Z  a-iD*  +  I  gio  In  x,  +  Z  Z 

i  3  J  i 

Di  In  Xj 

i  - -  -i 

Variable  No. 

Coefficient 

Partial  F 

1 

6.402702 

13.63 

3 

0.084548 

0.10 

5 

0.412407 

37.28 

8 

11.320694 

23.80 

10 

-1.135445 

17.68 

11 

-1.457859 

26.48 

14 

3.710527 

7.25 

16 

-2.950970 

9.44 

17 

-0.092716 

0.09 

20 

0.322015 

0.07 

23 

-0.568085 

27.14 

26 

-0.729848 

7.51 

27 

-1.803242 

9.46 

28 

2.506829 

12.27 

63 

-1.995969 

18.20 

64 

3.034970 

17.51 

68 

-0.272142 

7.44 

70 

-0.758240 

8.11 

75 

0.294839 

25.70 

90 

-0.456146 

24.86 

94 

0.697895 

25.90 

96 

-0.642736 

43.88 

Constant 

-5.315378 

79.01 
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TABLE  VII 


Pulcher's  Leaps  and  Bounds  Models  -  Phase  I  Data 


In  (LSC/OH)  -  a0  +  I  aiDi  +  E  8j0  In  Xj  +  E  E  BjiDi  In  Xj 


Variable 

Cp  Criterion 

R2  =  0.9135 

R2  =  0.88347 

F  -  31.21 

R^  Criterion 

R2  =  0.9323 

R2  -  0.9001 

F  =29.25 

Coefficient 

Partial-F 

Coefficient 

Partial-F 

UP 

0.245908 

8.78 

0.313871 

14.52 

W 

0.384075 

7.75 

0.350494 

6.86 

SF 

-1.061926 

12.78 

-2.878942 

14.29 

SB 

-1.822390 

30.26 

-2.195891 

39.06 

DIG 

4.381530 

4.88 

NF*W 

-0.431742 

31.61 

-0.343076 

2.10 

NF*CC 

-0.466254 

13.70 

-0.470354 

15.84 

NF*PD 

0.738901 

16.62 

0.672722 

14.59 

NC*UP 

0.285409 

5.13 

0.254284 

4.04 

NC*V 

-0.334677 

4.93 

-0.292486 

3.92 

SF*CC 

0.293229 

6.30 

DIG*UP 

-0.584870 

12.86 

-0.950128 

11.70 

DIG*V 

-0.971576 

2.25 

DIG*W 

2.676919 

4.93 

DIG*PD 

1.081951 

15.97 

0.553008 

2.59 

AN*W 

0.309271 

16.60 

0.239272 

9.98 

EM*W 

0.698175 

13.89 

0.705835 

13.47 

EM*PD 

-0.555855 

21.58 

-0.545678 

20.61 

BF*W 

0.866668 

28.67 

0.828916 

27.04 

BF*%SS 

-0.701034 

37.03 

-0.706378 

38.19 

Constant 

-3.855040 

53.44 

-4.091618 

64.16 

All  other  coefficients  are  zero. 
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which  prescreens  Che  variables  and  eliminates  those  which  are  unimportant. 
The  Chow  Test  also  determines  which  subpopulation  really  had  different 
coefficients.  Sixty  variables  remained  and  were  used  in  conjunction 
with  the  three  models. 

A  stepwise  regression  procedure  using  SPSS  was  used  to  develop 
the  first  model  and  the  Leaps  and  Bounds  Algorithm  was  used  to  create  the 
second  and  third  models,  the  second  using  R2  as  a  criterion  for  selection 
and  the  third  using  Mallows'  Cp  -statistic  as  a  criterion  for  selection. 

All  three  of  these  models  did  a  very  good  job  of  predicting  the  old 
data  as  determined  by  the  R2  value,  however,  in  his  final  conclusion, 
prediction  intervals  were  computed  using  the  Omnitab  computer  package  [20], 
and  it  was  determined  that  both  the  Leaps  and  Bounds  Cp  and  the  Leaps  and 
Bounds  R2  model  did  a  better  Job  of  prediction  than  the  SPSS  model. 
Automatic  Interaction  Detection 

It  has  been  suggested  that  another  method  of  prescreening  variables 
prior  to  regression  is  the  Automatic  Interaction  Detection  (AID)  computer 
package  developed  at  the  University  of  Michigan's  Institute  for  Social 
Research  and  documented  by  Sonquist  and  Morgan  [33,34],  This  technique 
is  primarily  used  in  constructing  models  on  sociological  or  categorical 
data  and  involves  a  single  interval  scaled  criterion  variable  and  a 
mixture  of  Interval,  ordinal,  and  nominally  scaled  predictor  variables. 

A  typical  problem  in  regression  analysis  is  that  one  cannot  always 
know  in  advance  which  transformations  such  as  X£  or  ln(X^),  or  interaction 
terms  such  as  XjXj  to  introduce  in  the  model  so  that  the  predictive 
power  of  the  model  is  maximized.  A  larger  error  term  reported  in  much 
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of  today's  research  may  be  partly  due  to  the  way  in  which  these  predictor 
variables  are  combined  in  the  model,  and  it  is  this  problem  of  locating 
specific  interaction  effects  between  variables,  if  in  fact  they  do 
exist,  that  is  the  basis  for  this  investigation.  Since  AID  also 
determines  the  variables  most  important  to  the  model,  its  main  purpose 
in  this  investigation  will  be  as  a  screening  device  to  locate  those 
variables  most  important  to  the  regression  model,  thus  reducing  the 
number  of  possible  variables  considerably. 

AID  Algorithm  and  Objective 

The  AID  analysis  is  somewhat  of  a  branch  and  bound  procedure  using 
analysis  of  variance  technique  that  is  useful  in  studying  the  inter¬ 
relationships  among  a  set  of  variables  and  useful  in  maximizing 
the  predictive  power  of  a  multiple  regression  model.  Unlike  most 
multiple  regression  procedures,  linearity  and  additivity  assumptions 
are  not  necessary  requirements  in  the  AID  analysis. 

The  AID  algorithm  accomplishes  a  sequential  division  of  the  entire 
data  into  subsets  based  on  that  split  which  causes  the  greatest 
reduction  in  the  unexplained  variability  of  the  criterion  variable. 

On  the  first  iteration,  the  entire  data  base  is  split  into  two  groups 
around  that  variable  which  allows  for  the  minimum  within-group 
variability  measured  by  the  sum  of  squared  deviations  of  the  criterion 
variable  from  the  group  means.  On  each  successive  iteration,  one  of 
the  existing  groups  is  split  in  the  same  manner  as  in  the  first  step. 

This  process  continues  until  one  of  the  stopping  criteria  has  been 
satisfied. 
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The  AID  model  can  be  written  as: 

Ymi  “  ui  +  emi  m  “  1»2****.n  (34) 

i  -  1 , 2 , . . . ,  g 

where:  Ymi  is  the  mc^  criterion  variable  observation  in  group  i 
Hi  is  the  ic^  group  mean 

emi  is  the  random  error  of  the  criterion  variable  observation 
in  group  1 

This  random  error  term  has  the  same  assumptions  as  the  random  error  term 
ei  which  was  discussed  in  Chapter  II. 

An  estimate  for  ui  is  Yi,  the  sample  mean  of  the  observations  in 
group  i.  Letting  Y  be  the  sample  mean  for  the  criterion  variable,  the 
total  variability  in  the  criterion  variable  (in  AID  notation)  can  be 
stated  as  follows: 

g  ni 

TSST  -IE  (Yml  _  y)2  (35) 

i-1  m-1 


This  value  will  be  constant  for  any  given  set  of  n  observations. 

Equation  (35)  can  be  expanded  to: 

g  n  gn  _  g  ni  _ 

1  1  (Ymi  -  y)  «  2  z  (Yml  "  Y±)2+  z  E  (Yt  -  y)2  (36) 

i-1  m-l  i-1  m-1  i-1  m-1 


or:  TSST  -  WSS  +  BSS 


where:  TSST  is  the  total  sum-of-squares  for  the  entire  sample 

WSS  is  the  wlthin-group  sum-of-squares 

BSS  is  the  between-group  sum-of-squares 

The  last  term  can  be  simplified  to: 

8  _ 

BSS  -  E  ni  (Yi  -  y)2 
i-1 


(37) 


The  objective  of  the  AID  algorithm  at  each  iterative  step  is  to 
split  the  groups  so  that  BSS  is  as  large  as  possible  thus  making  WSS 
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as  small  as  possible.  A  good  measure  of  Che  goodness  of  the  resulting 


model  Is: 

R2  - 

where  BSS*  is  the  BSS  of  the  existing  groups.  As  in  the  multiple 
regression  case,  the  R2  value  Indicates  the  fraction  of  the  variability 
in  the  criterion  variable  explained  by  the  regression  equation.  In 
AID,  an  R2  value  close  to  one  indicates  that  the  splitting  process  has 
done  a  good  job  of  grouping  observations  with  nearly  identical  values 
of  the  criterion  variable. 

At  each  split,  equation  (34)  can  be  written  as: 

TSSi  =  WSS±  +  BSSi  (39) 

Using  this  notation,  the  AID  algorithm  at  each  iteration  can  be 
generalized  as  follows: 

(1)  Select  that  unsplit  sample  group  which  has  the  largest  total 
sum-of-squares  around  its  own  mean  as  a  candidate  for  further  splitting. 

(2)  For  each  predictor  variable,  find  the  subset  of  observations 

in  the  group  selected  in  Step  1  which  maximizes  BSS^  (or  minimizes  WSS^) . 

(3)  Chose  the  best  partition  of  observations  on  a  predictor  and 
split  the  group  using  that  predictor  variable. 

(4)  Repeat  Step  1  until  a  stopping  criteria  has  been  satisfied. 

The  logic  of  the  AID  algorithm  can  be  easily  summarized  in  a  flow 
diagram  developed  by  Gooch  [14]  and  simplified  by  McNichols  [25]  in 
Figure  2. 


BSS* 


TSST 


0  < 


R2< 


(38) 


Select  subgroup  with  largest  TSS^ 

a.  Check  for  minimum  group  size 

b.  Check  limits  on  minimum  TSS^ 

Further  Splitting  Possible? 

- 1 - 


Yes 

_ _ 

For  each  predictor  variable: 

a.  Find  the  criterion  mean  for 
each  predictor  value. 

b.  If  nominal  variable,  sort 
predictor  values  by 
criterion  mean. 

c.  Find  BSS  values  for  splits 
between  adjacent  predictor 
values . 

d.  Select  best  split  (MaxBSS) 
for  this  predictor 


['  Select  best  split  overall  predictors. 

Perform  split 

if  resulting  groups  are 

large  enough. 

Output  iteration 

results. 

Print  Split  Summary  and  AID  trees 


[STOP] 


FIGURE  2  Logic  of  the  AID  Algorithm 
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Stopping  Criteria 

There  are  four  Important  stopping  criteria  used  In  the  AID  algorithm 
which  are  Indicated  by  the  user. 

(1)  The  maximum  number  of  final  groups  including  those  which  can 
and  cannot  be  further  split  cannot  exceed  the  value  MAXGP  or  termination 
will  occur. 

C2)  The  number  of  observations  in  each  group  that  is  split  cannot 
be  less  than  the  value  NMIN, 

(3)  The  total  sum  of  the  squares  in  a  group,  TSSit  cannot  be  less 
than  PI  percent  of  the  total  sum  of  squares  for  the  entire  sample,  TSST. 
Numerically  speaking,  PI  <  TSSi/TSST. 

(4)  Any  split  must  reduce  the  original  within  group  sum  of  squares 
by  P2  percent  or  the  AID  algorithm  is  terminated. 

Gooch  suggests  that: 

PI  >  .01 

P2  _>  .005 

MAXGP  <_  90 

NMIN  _>  5%  of  the  total  number  of  observations 
Analysis  of  the  AID  Output 

One  of  the  main  features  of  the  AID  package  is  the  three  diagram 
which  graphically  describes  the  splitting  process  of  each  of  the  groups. 
The  structure  of  these  trees  is  very  Important  in  determining  the  nature 
of  the  variable  interactions  in  the  model. 

Sonquist  and  Morgan  describe  two  basic  structures  or  shapes  of 
the  trees,  the  trunk-twig  structure,  and  the  trunk-branch  structure. 

The  truck-twig  structure  allows  only  one  of  two  groups  split  to  be  split 
again.  The  group  that  is  not  split  is  classified  a  final  group. 
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There  are  three  basic  types  of  trunk-twig  structured  trees:  top 
termination,  bottom  termination,  and  alternating  termination  (See 
Figure  3) ,  The  top  termination  structure  is  referred  to  by  Sonquist 
as  an  "alternative  advantage"  model,  where  the  nature  of  the  advantage 
is  determined  by  the  characteristic  which  split  the  group.  In  this 
structure,  those  groups  in  the  upper  branches  always  have  a  higher 
mean  value  than  the  lower  branches,  and  once  formed,  these  upper  branches 
cannot  be  split  any  further. 


a.  TOP  TERMINATION 


b.  BOTTOM  TERMINATION 


c .  ALTERNATING 
TERMINATION 


FIGURE  3  Trunk-Twig  Structured  AID  Trees 
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Sonquist  refers  to  the  bottom-termination  structured  tree  as  an 
"alternative  disadvantage"  tree,  where  the  nature  of  the  disadvantage  is 
determined  by  the  characteristic  which  split  the  group.  In  this  case, 
the  lower  branches  once  formed,  cannot  be  split  further. 

In  the  alternating  termination  structure,  the  interpretation  can 
be  viewed  as  a  combination  of  the  two  preceding  structures  whereby  the 
Importance  of  a  split  depends  solely  on  the  characteristics  of  the 
variable  which  split  the  group. 

The  trunk-branch  structure  is  analogous  to  the  trunk-twig  structure 
except  that  each  group  split  is  a  candidate  for  further  splitting.  This 
type  of  tree  structure  is  typical  of  the  first  few  splits  in  any  AID 
tree.  Once  the  first  few  splits  on  a  group  have  been  made,  the  structure 
usually  exemplifies  that  of  the  trunk-twig  structure. 

Besides  the  structure  of  the  tree,  the  symmetry  of  the  tree,  or 
lack  thereof,  concerning  the  extent  to  which  the  same  variables  appear 
in  a  split  on  various  trunks  is  important  also.  Non-symmetry  implies 
that  an  interaction  exists.  Also,  if  a  variable  is  split  on  one  trunk 
and  shows  no  indication  of  reducing  the  predictive  power  in  another 
trunk,  then  there  is  a  clear  evidence  of  an  interaction  effect  between 
that  variable  and  those  used  in  the  preceding  splits.  The  predictive 
power  of  each  variable  in  a  group  is  evaluated  by  the  statistic  BSSj/TSSf 
and  is  shown  on  the  selected  AID  output  in  Appendix  D.  This  statistic 
represents  the  proportion  of  the  variation  in  the  group  to  which  the 
predictor  variable  is  being  applied  that  would  be  explained  if  that 
group  were  split. 
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Preparation  for  the  use  of  AID 


In  order  to  use  the  AID  computer  package,  several  important  steps 
had  to  be  followed.  First  of  all,  the  data  had  to  be  transformed  so  that 
an  integer  format  could  be  used  to  describe  each  data  element  in  a  six 
place  field.  Since  many  variables  were  calculated  to  as  many  as  13 
decimal  places,  those  variables  had  to  be  multiplied  or  divided  by  a 
specified  factor  of  10  and  then  truncated.  For  example:  LSC/OR  was 

multiplied  by  10^  then  truncated,  so  LSC/OH(27)  =  26.63122286176  became  / 

266312. 

It  is  possible  that  by  reducing  the  number  of  significant  places, 
round  off  errors  and  non-comparible  values  would  result. 

Secondly,  all  data  points  for  each  variable  had  to  be  sequentially 
ordered  and  placed  into  groups  or  categories  of  equal  size.  (See  Table  VIII) 

This  is  done  so  that  when  the  groups  are  split  by  AID,  each  mean  will  be 
stable  with  respect  to  the  elements  in  that  group. 

After  the  data  is  transformed  to  the  proper  form,  the  computer 
deck  can  be  formed.  The  itemized  input  is  described  in  Appendix  C. 

Results 

As  stated  earlier,  the  Important  parameters  in  the  AID  input  are 
PI,  P2,  NMIN,  and  MAXGP.  Many  attempts  with  various  combinations  of 
these  parameters  were  made  and  are  described  in  Table  IX. 

In  the  first  four  runs  NMIN  was  set  to  4,  which  means  that  no 
groups  will  be  split  unless  there  are  at  least  8  data  points  in  that 
group  (4  for  each  subgroup  split) . 
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TABLE  VIII 


Sequential  Ordering  of  Variables 


Variable 

No. 

Recode 

FGTNAV 

1 

0 

LESS  THAN  1 

1 

1  OR  OVER 

BOMNAV 

2 

0 

LESS  THAN  1 

1 

1  OR  OVER 

CARNAV 

3 

0 

LESS  THAN  1 

1 

1  OR  OVER 

FGTSEN 

4 

0 

LESS  THAN  1 

1 

1  OR  OVER 

BOMSEN 

5 

0 

LESS  THAN  1 

1 

1  OR  OVER 

FGTCOM 

6 

0 

LESS  THAN  1 

1 

1  OR  OVER 

BOBCOM 

7 

0 

LESS  THAN  1 

1 

1  OR  OVER 

UNIT  PRICE 

8 

0 

LT.  OR  EQ.TO  2241 

1 

2242  TO  3914 

2 

3915  TO  8410 

3 

8411  TO  19274 

19275  OR  OVER 

VOLUME 

9 

0 

LT.  OR  EQ.  TO  275 

1 

276  TO  560 

2 

561  TO  1377 

3 

1378  TO  1734 

4 

1735  OR  OVER 

WEIGHT 

10 

0 

LT.  OR  EQ.  TO  850 

1 

851  TO  1500 

2 

1501  TO  3600 

3 

3601  TO  4900 

4 

4901  OR  OVER 

COMPONENT COUNT 

11 

0 

LT.  OR  EQ.  TO  88 

1 

89  TO  399 

2 

400  TO  911 

3 

912  TO  1186 

4 

1187  OR  OVER 
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TABLE  VIII  (Cont'd) 


Variable 

No. 

Recode 

PERCENTDIGITAL 

12 

0 

LT.  OR  EQ.  TO  50 

1 

51  TO  440 

2 

441  TO  550 

3 

551  TO  870 

4 

871  OR  OVER 

PERCENTANALOG 

13 

0 

LT.  OR  EQ.  TO  240 

1 

241  TO  740 

2 

741  TO  750 

3 

751  TO  990 

4 

991  OR  OVER 

PERCENTEM 

14 

0 

LT.  OR  EQ.  TO  5 

1 

6  TO  20 

2 

21  TO  140 

3 

141  TO  760 

4 

761  OR  OVER 

PERCENTPS 

15 

0 

LT.  OR  EQ.  TO  5 

1 

6  TO  80 

2 

81  OR  OVER 

PERCENTXMTR 

16 

0 

LT.  OR  EQ.  TO  100 

1 

101  TO  190 

2 

191  TO  250 

3 

251  OR  OVER 

PERCENTSS 

17 

0 

LT.  OR  EQ.  TO  230 

1 

231  TO  860 

2 

861  TO  975 

3 

976  TO  995 

4 

996  OR  OVER 

POWERDIS 

18 

0 

LT.  OR  EQ.  TO  60 

1 

61  TO  150 

2 

151  TO  270 

3 

271  TO  500 

4 

501  OR  OVER 

BITPIT 

19 

0 

LT.  OR  EQ.  TO  5 

1 

6  TO  40 

2 

41  TO  130 

3 

131  OR  OVER 
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Variable 


No 


Recode 


IC 


20 


0 

1 

2 

3 


LT.  OR  EQ.  TO  1 

2  TO  5 

6  TO  77 

78  OR  OVER 


SRU 


21 


0  LT.  OR  EQ.  TO  3 

1  4  TO  9 

2  10  TO  12 

3  13  TO  16 

4  17  OR  OVER 


QPA 


22 


0  LESS  TRAN  2 

1  2  OR  OVER 


TABLE  I Y. 


Result  of  AID  Runs 


Run  Number 

Parameters 

1 

2 

3 

4 

5 

PI 

.015 

Kg 

.0015 

.001 

.005 

P2 

.015 

B 

.0015 

.001 

.005 

NMIN 

4 

4 

4 

3 

MAXGP 

30 

B 

30 

30 

30 

R2 

.617 

.617 

.683 

.683 

.694 

Variables* 

V 

X 

X 

X 

X 

X 

AN 

X 

X 

X 

X 

X 

W 

X 

X 

X 

X 

X 

CC 

X 

X 

X 

X 

X 

CARNAV 

X 

X 

X 

X 

X 

XMTR 

X 

X 

X 

X 

PD 

X 

X 

X 

UP 

*  Those  which  AID  determined. 
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This  value  was  lowered  to  3  in  the  following  3  runs.  Notice  that 
when  NM1N  was  increased  to  5  in  run  number  8,  R  decreased  from  .694 
in  run  number  8  to  .596.  So  indeed,  these  parameters  are  important 
in  modeling  decisions. 

The  two  best  runs  (based  on  highest  R2  values)  were  runs  5  and  7, 

where  number  7  contains  three  parameters  recommended  by  Gooch.  Run 

number  7  was  chosen  as  the  test  case  to  build  the  regression  model 
used  in  this  research  and  two  approaches  were  developed  from  this  run. 

The  AID  tree  and  results  for  run  number  7  are  described  in  Figure  4 

and  Table  X. 

Since  the  main  objective  of  using  AID  is  to  reduce  the  total 
number  of  variables  used  and  only  choose  those  which  are  most  important 
to  the  regression,  a  choice  can  be  made  as  to  where  to  stop  considering 
variables  for  analysis  purposes. 

If  the  analysis  is  stopped  when  N  reaches  4,  then  three  variables 
remain:  V,  W,  and  AN.  Considering  interaction  terms  or  cross  produce 
terms,  six  variables  can  be  used:  V,  W,  AN,  V’W,  V»AN,  and  W-AN. 

Another  choice  would  be  to  stop  considering  variables  for  analysis 
when  N  reaches  3.  In  this  case,  7  variables  remain,  V,  W,  CC,  PD,  AN, 
XMTR  and  CARNAV.  AN  and  XMTR  can  be  considered  partial  indicators 
in  the  sense  that  they  can  be  represented  as  indicators  (0  or  1)  where 
zero  indicates  that  AN  or  XMTR  equals  zero  and  the  value  one  indicates 
that  AN  or  XMTR  is  greater  than  zero.  These  indicator  variables 
are  referred  to  in  the  analysis  as  IAN  and  IXMTR.  CARNAV  is  a  pure 
indicator  (either  0  or  1) .  In  this  case  it  was  decided  to  use 
interaction  terms  between  the  first  six  original  variables  and  the 
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TABLE  X 


AID  Tree  Results  for  Run  No.  7 


VARIABLE 

MEAN 

STD.  DEV. 

N 

R2 

- 

• 

13687.90 

15871.88 

- 

V 

0  12 

6708.97 

7352.77 

.294 

V 

3  4 

24295.88 

19133.53 

25 

.294 

AN 

13  4 

11667.33 

9052.37 

.435 

AN 

0  2 

31399.44 

19640.68 

.435 

CC 

13  4 

25506.08 

9193.78 

.540 

CC 

0  2 

49079.50 

29540.98 

4 

.540 

W 

0  13  4 

4040.72 

4309.02 

29 

.595 

W 

2 

15306.67 

8460.32 

9 

.595 

CARNAV 

0 

21670.25 

9449.06 

8 

.617 

CARNAV 

1 

33177.75 

4745.71 

4 

.617 

CC 

2  4 

6328.60 

3030.33 

5 

.638 

CC 

1  3 

18340.75 

9629.98 

4 

.638 

AN 

3  4 

6313.67 

1385.32 

3 

.661 

AN 

0  12 

19803.17 

6763.90 

6 

.661 

PD 

4 

17311.25 

7900.37 

4 

.670 

PD 

2  3 

26029.25 

6508.13 

4 

.670 

XMTR 

0  1 

2957.56 

3052.07 

25 

.684 

XMTR 

2  3 

10810.50 

4820.07 

4 

.684 

PD 

1  4 

16137.67 

4832.67 

3 

.689 

PD 

2 

23468.67 

6417.00 

3 

.689 

W 

1  3 

943.50 

670.84 

12 

.694 

W 

0 

4816.69 

3208.98 

13 

.694 

three  indicators  variables.  A  total  of  twenty-three  variables  are 
created  in  this  case.  A  list  of  both  sets  of  variables  created  in 
this  case  are  listed  in  Table  XI. 

In  order  to  decide  which  model  should  be  used,  each  set  of 
variables  was  run  through  the  IMSL-RLEAP  (Leaps  and  Bounds)  program 
described  earlier.  Using  R2  as  a  criterion,  the  23-variable  model 
explained  71.8  percent  of  the  variance  with  17  of  the  23  variables, 
while  the  6-variable  models  only  explained  50.1  percent  of  the  variance 
using  all  six  variables.  See  Appendix  D  for  a  selected  AID  output 
and  Appendix  F  for  a  selected  Leaps  and  Bounds  output. 

Next  a  log  transformation  was  made  on  the  23-variable  model 
and  run  through  Leaps  and  Bounds,  and,  surprisingly,  the  results  did 
not  show  an  improvement  over  those  of  the  untransformed  data.  Thus, 
the  untransformed  17  variables  chosen  by  Leaps  and  Bounds  were  accepted 
as  those  AID  determined  most  important.  This  model  will  therefore 
be  used  in  the  cross-validation  experiments  to  follow.  This  17- 
variable  model  is  described  in  Table  XII. 


TABLE  XI 


Variables  in  the  Two  AID  Models  Considered 


Model  1 


V 

AM 

W 

V*W 

V-AN 

W-AN 


Model  2 


V 

W 

CC 

PD 

AN 

XMTR 

CARNAV 

V* CARNAV 

W’ CARNAV 

CC- CARNAV 

PD -CARNAV 

XMTR* CARNAV 

V- IAN 

V-IXMTR 

W-IAN 

W-IXMTR 

CC-IAN 

CC-IXMTR 

PD -IAN 

PD.IXMTR 

AN.IXMTR 

XMTR. IAN 

IAN 

IXMTR 
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TABLE  XII 


AID  Regression  Model  Determined  by  Leaps  and  Bounds 


17 

LSC/OH  =  Bq  +  z  B±  X± 

i=l 

R2  -  .  718 

i 

Bi 

Xi 

Partial-F 

0 

2.658290567 

- 

- 

Warn 

-  .155899  x  lO-2 

V 

11.7804 

-  .779107  x  10*1 

W 

22.7635 

■I 

.105464  x  lO-2 

PD 

5.52895 

mm 

.961796  x  10_1 

XMTR 

8.75757 

Wm 

.261128  x  101 

CARNAV 

7.58031 

■S 

.700891  x  10-1 

W • CARNAV 

14.4932 

mm 

-  .506175  x  10-2 

PD. CARNAV 

18.0098 

8 

-  .267022  x  10-1 

AN -CARNAV 

7.62132 

9 

.878194  x  10-3 

V.IAN 

9.5718 

10 

-  .12007  x  10-2 

W-IAN 

17.8296 

11 

.143445  x  10“2 

W*IXMIR 

3.85888 

12 

.204243  x  lO-2 

CC.IAN 

12.2973 

13 

-  .112446 

CC  •  IXMTR 

13.4675 

14 

.21432  x  10-2 

PD. IAN 

15.1695 

15 

-  .402166  x  10"2 

PD. IXMTR 

22.0968 

16 

.153848 

XMTR. IAN 

8.40514 

17 

-  .547619  x  10l 

IXMTR 

7.77109 
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V  Cross  Validation,  Conclusions  and  Recommendations 


Cross  Validation 

Three  equations  developed  by  Pulcher  and  one  developed  by 
Westinghouse  have  been  reviewed,  and  one  model  developed  by  AID 
has  been  analyzed.  All  have  been  based  on  the  old  Westinghouse 
data  collected  in  Phase  I  containing  63  data  points. 

A  cross  validation  procedure  was  used  to  determine  how  well 
these  old  models  predict  the  new  71  data  points  contained  in  the 
Phase  II. 

The  first  step  was  to  use  the  new  data  in  each  of  the  old  models 
to  find  the  cross  validation  SSE  and  SST.  They  were  then  used  to  find 
the  cross  validation  R2  described  in  Chapter  II.  A  summary  of 
results  is  given  in  Table  XIII. 

In  both  the  Westinghouse  model  and  the  AID  model,  the  cross 
validation  SSE  was  greater  than  the  SST.  This  would  tend  to  imply 
that  neither  of  the  two  models  predict  the  new  data  very  well.  This 
is  a  surprising  result  especially  for  the  Westinghouse  model. 

One  possible  explanation  for  this  is  that  the  Westinghouse  model 
was  developed  in  such  a  way  that  much  of  the  idiosyncrecies  of  the 
data  were  explained.  Notice  the  vast  difference  between  the  first 
model  described  in  Table  IV  and  the  second  described  in  Table  V.  This 
could  also  be  the  reason  why  the  AID  model  failed  to  predict  the 
new  data. 
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TABLE  XIII 


Cross  Validation  Results 


Model 

c.v.  SSE 

SST 

L  &  B  -  R2 

157.9096802275 

227.701363 

L  &  B  -  Cp 

112.4418454273 

227.701363 

SPSS 

89.457779054 

227.701363 

Westinghouse 

* 

227.701363 

AID 

* 

1755.2523798 

c.v.  R2 

.3065053361 

.50618720 

.6071269467 

* 

* 


*  c.v.SSE  was  greater  than  SST 
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The  best  model  determined  by  the  cross  validation  criterion  was 
Pulcher's  SPSS  model  which  had  a  c.v.  R2  value  of  .607  (see  Table  XIII). 
Using  those  variables,  updated  coefficients  have  been  computed 
(see  Table  XIV) .  This  new  model  using  the  old  variables  and  just 
the  Phase  II  data  has  an  R2  value  of  .780  indicating  that  78%  of 
the  variance  in  the  dependent  variable  is  explained  by  the  model. 

With  the  complete  set  of  data  (134  data  points)  70.9%  of  the  variance 
was  explained  by  the  model.  Table  XV  describes  this  model.  (See 
Appendix  E  for  selected  outputs  from  SPSS.) 

Conclusions 

A  review  of  past  research  Indicates  that  much  literature  is 
available  on  criterion  in  the  selection  of  variables  in  a  multiple 
regression  thus  Indicating  that  it  is  an  important  subject  not  only 
for  mathematicians  or  operations  researchers,  but  is  important  to 
anyone  attempting  to  develop  valid  models  both  for  description  and 
prediction  purposes.  As  a  result,  these  criteria  give  the  statisticians 
a  useful  index  of  how  well  various  models  fit  the  data,  however, 
experience  shows  that  the  result  of  using  a  single  criterion  should 
not  be  accepted  as  a  final  answer,  but  should  be  used  with  other 
available  statistics  and  Individual's  intuitive  judgement  in 
developing  a  sound  analysis. 

This  cross  validation  R2  value  was  useful  in  evaluating  the 
prediction  capabilities  of  the  five  models  discussed.  The  three 
models  which  used  log  transformed  data  and  were  developed  by 
Pulcher  for  description  purposes  on  the  old  Westinghouse  data 
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TABLE  XIV 


Pulcher's  SPSS  Model  fitted  to  the  New  Data  Points 


In  LSC/OH  -  c*0 

+  E  o^Di  +  2  in  + 

i  j 

I  I  2-hDi  In  X4 

J  i 

R2  -  .67923 

R2  -  .78004 

F  =  7.73752 

Variable  Name 

Coefficient 

Partial“F 

UP 

.36000615 

.17946696 

W 

.60315963 

.43885436 

SS 

1102.0/08 

280.30031 

NB 

8.2056618 

17.346404 

SF 

-.33287310 

.39887747 

SB 

.99001459 

.83450842 

DIG 

-1.3736140 

2.5219780 

EM 

2.6000225 

1.6961873 

PS 

.12680075 

.31751393 

NF  *  UP 

.13008302 

.15751183 

NF  *  CC 

-.13099197 

.22804494 

NB  *  UP 

-.34773058 

1.1390687 

NB  *  V 

- 

- 

NB  *  W 

-.75005236 

2.6271934 

DIG  *  V 

-.14280299 

.60250900 

DIG  *  W 

.49276704 

.61187387 

AN  *  UP 

.06467638 

.13817898 

AN  *  W 

-.13646127 

.43869986 

EM  *  V 

-.35382193 

.25552978 

XMTR  *  CC 

-.36370529 

.46191312 

XMTR  *  SS 

531.35247 

667.96390 

BF  *  W 

.0559571 

.2229430 

BF  *  SS 

59.533507 

207.10304 

Constant 

-10.43849 

1.4928598 

**Removed  from  the  equation  by  SPSS 
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TABLE  XV 

Pulcher's  SPSS  Model  Variables  fitted  to  the  Entire  Data  Set 

(Phase  I  &  Phase  II) 

In  LSC/OH  =  ac  +  Z  a^i  +  t  Bj0  In  xj  +  l  l  D*  In  xj 
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had  adequate  predictive  capabilities;  the  other  two  models  (the 
Westinghouse  model  and  the  AID  model)  were  determined  not  to  have  very 
good  predictive  capabilities. 

The  Automatic  Interaction  Detection  Algorithm  was  useful  in 
prescreening  important  variables  and  reducing  the  total  number  of 
variables  to  be  used  in  a  multiple  regression,  however,  it  did  not 
prove  to  be  the  best  technique  in  developing  regression  models,  for 
the  maximum  R2  value  was  only  .780. 

Recommendations 

In  his  research,  Pulcher  used  the  Chow  Test  as  a  screening  device 
to  determine  the  most  important  variable  subset  using  a  Product  of 
Powers  model.  However,  one  assumption  in  using  the  test  Is  that  of 
equal  variances  on  the  error  term.  In  future  analysis,  I  would 
recommend  the  use  of  a  technique  developed  by  Jayatissa  [21]  of  Tests 
of  Equality  Between  Subsets  of  Coefficients  in  Two  Multiple  Regressions 
assuming  unequal  variances.  This  can  be  used  as  a  prescreening  device 
to  locate  important  variables.  Then  stepwise  regression  procedures 
using  SPSS  can  be  used  to  develop  a  multiple  regression  model. 

To  the  personnel  at  the  Avionics  Laboratory,  I  would  recommend 
that  cross  validation  studies  be  made  to  insure  that  models  developed 
by  contractors  be  able  to  predict  new  data  so  that  new  models  do  not 
have  to  be  developed  every  time  new  data  is  obtained. 

All  techniques  used  on  this  analysis  were  based  on  minimizing 
the  sum  of  squared  errors.  The  many  criterion  for  selection  of 
variables  mentioned  in  this  report  should  be  given  further  consideration. 
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APPENDIX  A 
LRU  Description 


No. 

LRU- ID 

AIRCRAFT 

DESCRIPTION 

1 

71B20 

Amplifier,  Computer 

2 

73530 

F4E 

Ballistics,  Computer 

3 

71LB0 

F4E 

Receiver-Transmitter 

4 

71HK0 

F4E 

Platform,  Gyro,  Stab. 

5 

71PK0 

RF4C 

Receiver-Transmitter 

6 

71PB0 

RF4C 

Amplifier,  P.S.  RCVR 

7 

71710 

RF4C 

P.S.  Leveling,  Amplifier 

8 

724G0 

RF4C 

Power  Supply 

9 

71G50 

RF4C 

Computer,  Navigation 

10 

71FA0 

F15A 

Amplifier,  Electronic 

11 

71FB0 

F15A 

Gyroscope,  Displacement 

12 

71CA0 

KC135A 

Receiver-Transmitter 

13 

71DA0 

F15A 

Receiver-Transmitter 

14 

71ABE 

B52H 

Receiver 

15 

71ADA 

B52H 

Receiver-Transmitter 

16 

73DBA 

B52H 

Receiver-Transmitter 

17 

71ACC 

B52H 

Receiver 

18 

73CB0 

B52H 

Amplifier 

19 

73CEN 

B52H 

Computer,  A2  and  EL 

20 

73CFK 

B52H 

Receiver-Transmitter 

21 

73DAH 

B52H 

Amplifier,  Electronic  Control 

22 

73EBA 

B52H 

Amp,  Astrotrack,  Servo 

23 

73EBF 

B52H 

Signal  Amplifier 

*24 

71CA0 

F15A 

Receiver 

25 

72EAA 

KC135A 

Rece iver-Tr ansmit  ter 

26 

72ECA 

KC135A 

Amplifier,  Electronic  Control 

27 

72BPO 

C5A 

Measurement  Unit,  IMU 

28 

7UA0 

C5A 

Receiver,  VHF  Navigational 

29 

71LA0 

C5A 

Receiver-Transmitter 

30 

72DN0 

C5A 

Processor  Data 

*  DUPLICATE  LRU- ID  —  Placed  on  a  Different  Aircraft 
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LRU  Description  (Con't) 


No. 

LRU-ID 

AIRCRAFT 

DESCRIPTION 

31 

72AC0 

C5A 

P . S . ,  Thermal  Control 

32 

7171A 

C130E 

Receiver 

33 

7131D 

C130E 

Receiver-Transmitter 

34 

72RF0 

C130E 

P.S.  Power  Supply 

35 

72RB0 

C130E 

Amplif ier 

36 

51EA0 

F15A 

Computer,  Air  Data 

37 

52AA0 

F15A 

Computer,  Flight  Control 

38 

52AB0 

F15A 

Computer,  Flight  Control 

39 

638D0 

F15A 

Control  Panel,  Int  Nav 

40 

71AE0 

F15A 

Inertial  Measurement  Unit 

41 

71AK0 

F15A 

Control  Indicator,  Nav 

42 

74JA0 

F15A 

Indicator,  Multiple  Air  Nav 

43 

74JC0 

F15A 

Processor,  Signal  Data 

44 

52GA1 

F106 

Amplifier-Interface 

45 

71JCE 

C5A 

Control  Panel  VHF  Nav 

46 

72AE0 

C5A 

Computer-Primary,  IDNE 

47 

72CC0 

C5A 

Computer-Analog/Digital 

48 

71ZA0 

C130E 

Receiver-Transmitter 

49 

71ZB0 

C130E 

Digital/Analog  Converter 

50 

71ZD0 

C130E 

Control  Unit 

*51 

71ZA0 

FlllD 

Receiver-Transmitter 

*52 

71ZB0 

FlllD 

Digital/Analog  Converter 

53 

71ZC0 

FlllD 

Control 

54 

73EG0 

FlllD 

Computer,  General  Purpose 

55 

73EP0 

FlllD 

Converter-Multiplexer 

56 

73HA0 

flllD 

Stabilizer  Platform 

57 

73HC0 

FlllD 

Navigational  Computer 

58 

73NA0 

FlllD 

Indicator,  Horizontal  Display 

59 

73NB0 

FlllD 

Processor,  Horizontal  Display 

60 

73QB0 

FlllD 

Electronic  Unit,  Radar 
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LRU  Description  (Con't) 


No. 

LRU- ID 

AIRCRAFT 

DESCRIPTION 

61 

73SC0 

F111D 

Indicator,  Digital  Display 

62 

73KB0 

F111D 

Antenna-Receiver 

63 

73KE0 

FlllD 

Amplifier,  Power  Supply 

64 

73KF0 

FlllD 

Synchronizer-Transmitter 

65 

73DD0 

FlllD 

Computer,  Terrain  Following 

*66 

71CA0 

FB111A 

Receiver  Unit 

67 

73EG0 

FB111A 

Computer,  General  Purpose 

68 

73HC0 

FB111A 

Navigational  Computer  Unit 

69 

73LA0 

FB111A 

Electronic  Unit 

70 

75930 

F4E 

Weapons  Release  Control 

71 

74BD0 

F4E 

Computer 

72 

74BF0 

F4E 

Transmitter 

73 

74810 

F4E 

Gyroscope,  Lead  Comp. 

74 

76A10 

RF4C 

Analyzer,  Pulse 

75 

76GA0 

RF4C 

Signal  Processor 

76 

74FF0 

F15A 

Processor 

77 

74FA0 

F15A 

Transmitter 

78 

74FB0 

F15A 

Power  Supply 

79 

74FU0 

F15A 

Antenna 

80 

77EC0 

B52H 

Flir  Signal  Proc. 

81 

77EE0 

B52H 

Flir  Turret  Drive 

82 

77DCA 

B52R 

STV  Camera,  Electronic 

83 

77DB0 

B52H 

STV  Turret  Drive 

84 

73CR0 

F4E 

Laser  Control,  Electronic 

85 

73CG0 

F4E 

Two  Axis  Gimbal  Assembly 

86 

65BH0 

F15A 

Processor,  Radar  Target  Data 

87 

74FC0 

F15A 

Receiver,  Radar 

88 

74FJ0 

F15A 

Oscillator-RF 

89 

74FK0 

F15A 

Radar  Set  Control 

90 

74FQ0 

F15A 

Processor,  Radar  Data 

*  DUPLICATE  LRU- ID  —  Placed  on  a  Different  Aircraft 


63 


APPENDIX  A 

LRU  Description  (Con't) 


No. 

LRU-ID 

AIRCRAFT 

DESCRIPTION 

91 

74KA0 

F15A 

Display  Unit,  Head  Up 

92 

74KC0 

F15A 

Processor  Signal  Data 

93 

75AE0 

F15A 

Conver ter-P  rogrammer 

94 

74CA0 

F4E 

Indicator,  Control 

95 

74CB0 

F4E 

Indicator,  Pilot 

96 

74CC0 

F4E 

Indicator,  PSO,  10 

97 

74FA1 

F106 

- 

98 

74EB0 

F15A 

Lead  Computing  Gyro 

99 

76AEA 

B52H 

Transmitter 

100 

73KA0 

FB111A 

Computer,  TFR 

101 

73PH0 

F111D 

Power  Supply,  LV 

102 

73PB0 

F111D 

Processor,  Electronic 

103 

73PD0 

F111D 

Radar  Transmitter 

104 

73PF0 

FlllD 

Signal  Data  Converter 

105 

73PM0 

F111D 

Reference  Signal  Gen. 

106 

71NA0 

F4E 

Receiver-Transmitter 

107 

71QU0 

RF4C 

Receiver-Transmitter 

108 

63AA0 

F15A 

Receiver-Transmitter 

109 

65AA0 

F15A 

Receiver-Transmitter 

110 

63BAA 

B52H 

Receiver-Transmitt  er 

111 

63CAA 

B52H 

Receiver-Transmitter 

112 

65BAA 

B52H 

Receiver-Transmitter 

113 

6  IBB  A 

B52H 

Receiver 

114 

65BAA 

KC135A 

Receiver-Transmitter 

115 

63AF0 

KC135A 

Receiver-Transmitter 

116 

63AA0 

C5A 

Receiver-Transmitter 

117 

63121 

C130E 

Receiver-Transmitter 

118 

63AAA 

C130E 

Receiver-Transmitter 

119 

55AL0 

C5A 

Central  Multiplex  Adapter 

120 

55AV0 

C5A 

Computer  Digital,  Madar 

APPENDIX  A 

LRU  Description  (Con't) 


No. 

LRU-ID 

AIRCRAFT 

DESCRIPTION 

121 

61AA0 

C5A 

Exciter  Receiver,  HF/SSB 

122 

clAC0 

C5A 

Amplifier/Antenna  Coupler 

123 

61AE0 

C5A 

Panel,  Control,  HF/SSB 

124 

62AA0 

C5A 

Transceiver,  VHF  Comm 

125 

63A60 

F15A 

Radio  Receiver 

126 

63BC0 

F15A 

Control  Panel,  Int  Comm 

127 

63BF0 

F15A 

Control  Panel,  IFF 

*120 

61AA0 

FB111A 

Receiver-Transmitter 

129 

61AB0 

FB111A 

Amplifier-Power  Supply 

*130 

61AC0 

FB111A 

Control 

131 

72AA0 

FB111A 

Control,  Radar  Transponder 

132 

72AC0 

FB111A 

Receiver  Transmitter 

133 

64211 

■ 

Intercom  Set 

134 

64212 

■ 

mSMM 

Control  Panel 
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ITEMIZED  INPUT  FOR  AID 


*  Extracted  from  McNichols  [25] 
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Itemized  Input  for  AID 


1.  Title  Card 


Card 


ColumnCs) 

Use 

Description 

1 

Card  Type 

Must  contain  the  numeric  value  "l". 

2-49 

Job  Title 

Up  to  48  alphabetic  and/or  numeric 
characters  used  to  label  the  run. 

50 

IRUN 

Numeric  "0"  for  normal  AID  operation 

51-56 

NCPERM 

Number  of  cases  in  the  data  file.  May 
be  omitted  when  data  is  from  a  disk  or 

tape  file. 

79-80 

IFMT 

The  number  of  cards  used  for  the  FORTRAN 
format  statement  Cthe  next  card  or  set  of 
cards  in  the  control  card  deck) .  Up  to  4 
cards  may  be  used. 

FORTRAN  Foramt 

CardCs) 

Card 

Column(s) 

Use 

Description 

1-78 

Data 

Format 

FORTRAN  format  statement  beginning  with  a 
left  parenthesis  and  ending  with  a  right 
parenthesis.  Only  integer  fields  of  the 

form:  Iw,  where  w  is  the  number  of  characters 
used  to  describe  a  variable,  can  be  specified 
The  characters:  X  can  be  used  to  skip  columns 
T  to  tab  to  a  desired  character  position,  and 
/  to  indicate  the  beginning  of  a  new  record 
for  multiple  record  cases.  Warning :  be 
careful  not  to  extend  the  format  statement 
beyond  column  78  as  these  characters  are 
not  processed  by  AID.  If  more  than  78 
characters  are  needed  for  the  format 
statement,  use  another  format  card  and 
change  the  count  in  column  80  of  the  title 
card. 
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Itemized  Input  for  AID  (cont'd) 


Description  Card 

Card 

Column(s) 

Use 

Description 

1 

Card  Type 

Must  contain  the  numeric  value  "3" 

2-6 

Stopping 
Rule: PI 

Minimum  value  of  TSSj/TSSj  to  consider 
group  i  for  splitting.  (Section  8.2.2, 
paragraph  2) ,  A  decimal  point  is  implied 
to  the  left  of  col.  2. 

7-11 

Stopping 

Rule:P2 

Minimum  value  of  BSS^/TSS^.  to  permit 
group  i  to  split  (Section  8.2.2,  paragraph 
3) .  A  decimal  point  is  implied  to  the 
left  of  column  7. 

12-16 

Stopping 

Rule 

MAXGP 

Maximum  number  of  subgroups  into  which 
the  set  of  data  will  be  split. 

17-21 

Stopping 

Rule: 

NMIN 

Minimum  number  of  observations  which  must 
be  in  a  group  after  it  is  split.  Value 
must  be  at  least  2. 

22-26 

Iteration 
Print : 

KSTOP 

Number  of  AID  iterations  for  which  detailed 
information  will  be  printed.  Only  summary 
results  for  iterations  will  be  output  after 
this  point. 

27-29 

No.  of 
Variables 

NV 

Specifies  number  of  variables  to  be  read 
from  each  case.  This  will  be  the  total 
number  of  variables  described  by  the  format 
statement. 

33 

Rewind : 

KRW 

Should  be  the  numeric  value  "1"  if  input 
data  is  on  a  disk  or  tape  file,  left 
blank  otherwise. 

34 

Missing 
Values : 

IOPT 

Set  to  "1"  if  a  case  with  any  out-of-range 
predictor  values  is  to  be  rejected,  blank  oi 
zero  otherwise.  The  "1"  value  is  analogous 
to  listwise  deletion  in  SPSS,  as  far  as  the 
predictor  variables  are  concerned.  There  is 
no  capability  in  AID  which  corresponds 
directly  to  a  pairwise  deletion  option.  The 
IOPT  setting  must  be  considered  when 
predictor  cards  (type  4)  are  coded. 

85 


APPENDIX  C 


Itemized  Input  for  AID  (cont’d) 


Card 


Column(s) 

Use 

Description 

37 

Input 

Medium: 

I  CARD 

If  zero  or  blank,  the  data  file  is  assumed 
to  be  a  disk  or  tape  file  with  the  local 
file  name  "TAPE25",  If  set  to  "1",  data 
is  assumed  to  be  on  punched  cards  which 
follow  the  AID  control  cards. 

38 

Tree 

Control; 

ITREE 

This  parameter  controls  the  output  of 
computer  printed  tree  diagrams  summarizing 
the  splits.  If  set  to  zero  or  blank,  no 
diagrams  are  generated.  If  set  to  "1", 
only  a  detailed  tree  is  generated.  If 
set  to  "2”,  both  a  detailed  and  a  skeleton 
tree  will  be  produced. 

Predictor  Card(s) 

Card 

Column(s) 

Use 

Description 

1 

Card  Type 

Must  contain  the  numeric  value  "4" 

There  will  be  one  predictor  card  for  each 
predictor  variable  to  be  used  in  the  AID 
run.  However,  all  predictors  described  by 
the  format  statement  do  not  need  to  be  used 
in  the  AID  run.  The  NV  parameter  (card  3) 
has  a  value  associated  with  the  number  of 
variables  described  by  the  format  statement, 
not  the  number  of  predictor  cards  used  in 
the  run. 

2-19 

Predictor 

Name 

Up  to  18  alphabetic  or  numeric  characters 
used  to  label  the  predictors  in  the  AID  output. 

20-22 

Field 

Number 

A  variable  number  which  must  correspond  to 
the  variable  sequence  provided  by  the  format 
statement.  This  is,  the  third  variable 
described  by  the  format  statement  represents 
field  number  3  for  predictor  variable  numbering 
purposes 

23 

Predictor 

Type: 

KB  LI 

Zero  or  blank  for  predictors  to  be  treated 
as  nominally  scaled,  "1"  for  variables  to 
be  treated  as  ordlnally  scaled.  The 

example  In  section  8.1  illustrates  the 
nature  of  the  treatment  of  nominal  and 
ordinal  variables  in  AID. 
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Itemized  Input  for  AID  (cont’d) 


Card 

Column(s)  Use  Description 


24  Predictor  This  parameter,  used  in  conjunction  with 

Definition:  the  IOPT  value  on  card  3,  tells  AID  how 
KBL2  to  Interpret  the  values  on  the  remainder 

of  the  predictor  card.  A  zero  value  Indicates 
that  the  range  of  possible  values  for  this 
predictor  variable  will  be  divided  into 
intervals  of  fixed  length.  A  value  of 
"1”  means  that  the  range  of  values  for 
this  predictor  will  be  divided  into 
Intervals  of  varying  length.  When  KBL2 
is  set  to  zero,  minimum  and  maximum  values 
and  an  interval  length  will  be  provided. 

When  KBL2  is  set  to  "1",  boundaries  for 
the  intervals  into  which  the  range  of 
predictor  values  will  be  divided  will  be 
specified.  Figure  8.7  summarizes  the 
interpretation  of  I0PT/KBL2  value  combinations 
and  should  be  referenced  in  choosing  the 
desired  values  and  predictor  card  format. 


A.  IOPT  Equal  Zero  and  KBL2  Equal  Zero: 


Card 


Column (s) 

Use 

Description 

25-30 

Minimum 
Predictor 
Value: MIN 

Predictor  variable  values  less  than  or  equal 
to  this  value  will  be  recoded  to  an  internal 
value  (recode  category)  of  00. 

31-36 

Maximum 
Predictor 
Value: MAX 

Predictor  variable  values  greater  than  or  equal 
to  this  value  will  be  recoded  to  the  highest 
recode  category  value  used  for  this  predictor. 
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Itemized  Input  for  AID  (cont'd) 


Card 

Column (s)  Use  Description 


37-42 


Interval  The  length  of  the  range  of  values  for 

Length:  this  predictor  to  be  recoded  into  a 

INT  single  recode  category.  The  recode 


Blank/Zero : 
Retain  Cases 
With  Out  of 
Range  Predictor 
Values 


IOPT 


One:  1 

Reject  Cases 
With  Out  of 


Blank/Zero: 
Recode  Equal 
Intervals 

•I  mi 

- - ! 

One: 

Specified  Recode 
Categories 


•Predictor  Card  Provides: 

a)  Minimum  value  of  predictor — 
MIN 

b)  Maximum  value  of  predictor — 
MAX 

c)  Interval  length — INT 
Note:  Number  of  intervals: 

CMAX-MIN) /INT  +  1 
Values  <_  MIN  recode  to  00 
Values  >_  MAX  recode  to  highest 
recode  category 

Predictor  Card  Provides: 

a)  Upper  boundary  for  each 
predictor  value  range 

b)  Recode  values  (00  to  39) 
for  each  predictor  value 
range 


Predictor  Card  Provides: 

a)  Minimum  value  of  predictor — 


MIN 


Range  Predictor 
Values. 


Blank/ Zero: 
Recode  Equal 
Intervals 


KBL2 


One: 


b)  Maximum  value  of  predictor — 
MAX 

c)  Interval  length — INT 
Note:  Number  of  intervals: 

(MAX-MIN) /INT 
Values  <MIN  or  >MAX  lead 
to  rejection  of  case 


Specified  Recode 
Categories 

j  Predictor  Card  Provides: 

a)  Minimum  value  of  predictor 
variable 

b)  Upper  boundary  for  each  range 
of  predictor  values 

c)  Recode  values  (00  to  39) 
for  each  range  of  predictor 
values 

Predictor  Card  Coding:  Interpretation  of  I0PT/KBL2  Values 


APPENDIX  C 


Itemized  Input  for  AID  Ccont'd) 


Card 

Columnfs)  Use 


44-45 

53-54 

62-63 


Description 


category  assigned  to  a  specific  predictor 
value  between  the  MIN  and  MAX  value  will 
be: 


Recode  Category* 


Predictor  Value-MIN 
INT 


The  number  of  recode  categories  will  be: 


NCAT  - 


MAX-MIN  ,  , 
INT  +  1 


The  highest  numbered  recode  category  will 
be  NCAT-1,  and  values  greater  than  or  equal 
to  MAX  will  be  assigned  this  value. 


In  a  basic  application  of  AID,  each  of 
these  pairs  of  columns  should  contain  the 
value  "-1".  These  columns  can  be  used  in 
conjunction  with  other  predictor  card 
parameters  to  alter  the  recoding  process  by 
assigning  specific  recode  categories  to 
specific  numeric  values  of  the  predictor 
variable.  Since  this  is  a  less  often  used 
capability,  it  will  not  be  discussed  in  detail 
here. 


B.  IOPT  Equal 

Zero  and  KBL2  Equal  One: 

25-27 

Lowest 

Recode 

Category 

A  value  between  00  and  39  which  is  the  numeric 
value  to  be  used  internally  by  AID  to  represent 
predictor  variable  values  less-than-or-equal- 
to  the  first  specified  input  value. 

28-33 

First 
Specified 
Input  Value 

A  value  of  the  predictor  variable — used  with 
lowest  recode  category. 

34-36 

Second 

Recode 

Category 

A  value  between  00  and  39  which  is  the  value  to 
be  used  internally  by  AID  to  represent 
predictor  variable  values  strictly  greater 

than  the  first  specified  input  value,  and 
less-than-or-equal-to  the  second  specified 
input  value. 
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Itemized  Input  for  AID  Ccont'd) 


Card 

Column(s) 

Use 

Description 

37-42 

Second 

Specified 

Input 

Value 

A  value  of  the  predictor  variable —  used 
in  conjunction  with  the  seond  recode  category 
as  the  boundary  of  the  predictor  variable 
values  to  be  recoded  to  the  value  specified 
by  the  second  recode  category. 

43-45 

Third 

Recode 

Category 

A  value  between  00  and  39  which  is  the 
value  to  be  used  internally  by  AID  to 
represent  predictor  variable  values 
strictly  greater  than  the  second  specified 
input  value  and  less-than-or-equal-to  the 
third  specified  input  value. 

46-51 

Third 

Specified 

Input 

Value 

A  value  of  the  predictor  variable — used  in 
conjunction  with  the  third  recode  category. 

52-54 

61-63 

Fourth  & 
Fifth 

Recode 

Categories 

The  descriptions  of  these  field  are  comparable 
to  those  given  for  the  first,  second  and 
third  recode  categories  and  specified  input 
values . 

C.  IOPT  Equal 

one  and  KBL2  Equal  Zero: 

25-30 

Minimum 
Predictor 
Value: MIN 

Predictor  variable  values  strictly  less  than 
this  value  will  cause  the  case  to  be  rejected. 

31-36 

Maximum 
Predictor 
Value: MAX 

Predictor  variable  values  greater  than  or  equal 
to  this  value  will  cause  the  case  to  be  rejected. 

37-42 

Interval 
Length : 

INT 

The  length  of  the  range  of  values  for  this 
variable  to  be  recoded  into  a  single  recode 
category.  The  recode  category  assigned  to  a 

specific  predictor  variable  between  MIN  and 


MAX  will  be: 


Predictor  Value-MIN 


Recode  Category 


INT 


APPENDIX  C 


Itemized  Input  for  AID  (cont’d) 


Card 

Coluu;;  ,'s)  Use 


44-45 

53-54 

62-63 


Description 


The  number  of  recode  categories  will  be: 


NCAT 


MAX-MIN 

INT 


In  a  basic  application  of  AID,  each  of  these 
pairs  of  columns  should  contain  the  value"-l". 
These  columns  can  be  used  in  conjunction  with 
other  predictor  card  parameters  to  alter  the 
recoding  process  by  assigning  specific 
recode  categories  to  specific  numeric  values 
of  the  predictor  variable.  Since  this  is  a 
less  often  used  capability,  it  will  not  be 
discussed  in  detail  here. 


D.  IOPT  Equal  One  and  KBL2  Equal  One: 


25-27 


28-33 


34-36 


37-42 


43-45 


Recode 

Category 


First 

Specified 

Input 

Value 

Second 

Recode 

Category 


Second 

Specified 

Input 

Value 

Third 

Recode 

Category 


Used  only  when  more  than  one  predictor  card  is 
required  to  describe  the  predictor  variable. 

On  the  first  predictor  card  for  a  variable  this 
field  should  be  blank. 

Predictor  variable  values  less-than-or-equal- 
to  this  value  will  cause  the  case  to  be 
rejected. 

A  value  between  00  and  39  which  is  the  value 
to  be  used  internally  by  AID  to  represent 
predictor  variable  values  strictly  greater 
than  the  first  specified  input  value,  and  less- 
than-or-equal-to  the  second  specified  input 
value. 

A  value  of  the  predictor  variable  associated 
with  the  second  recode  category. 


A  value  between  00  and  39  which  is  the  value  to 
be  used  internally  by  AID  to  represent  predictor 
variable  values  strictly  greater  than  the  second 
specified  input  value  and  less-than-or-equal-to 
the  third  specified  input  value. 


APPENDIX  C 


Card 

Column(s) 

46-51 

52-54 

61-63 

Criterion 

1 

2-19 

20-22 


23-24 


25-30 


Itemized  Input  for  AID  Ceont’d) 


Use 

Third 

Specified 

input 

Value 

Fourth  & 
Fifth 
Recode 
Categories 


Card 


Card  Type 

Criterion 

Name 


Field 

Number 


Weight 

Field 


Maximum 
Criterion 
Value : 
YMAX 


Description 

A  value  of  the  predictor  variable  associated 
with  the  third  recode  category. 


The  descriptions  of  these  fields  are 
comparable  to  those  given  for  the  first, 
second,  and  third  recode  categories  and 
specified  Input  values. 


Must  contain  the  numeric  value  "5" 

Up  to  18  Alphabetic  or  numeric  characters 
used  to  label  the  criterion  variable  in  the 
AID  output. 

A  variable  number  which  must  correspond  to 
the  variable  sequence  provided  by  the  format 
statement.  That  is,  the  third  variable 
described  by  the  format  statement  represents 
field  number  3  for  criterion  variable 
Identification  purposes.  The  criterion 
variable  does  not  have  to  be  the  field  which 
is  physically  last  In  each  case  as  long  as 
the  proper  field  numbers  are  used  to  Identify 
predictors  and  the  criterion. 

A  variable  number  representing  a  weight  field 
In  each  case,  used  to  weight  the  values  In 
AID  computations.  This  field  can  be  left 
blank,  causing  all  cases  to  be  equally  weighted, 
and  this  Is  the  normal  mode  of  operation. 

If  the  criterion  variable  value  Is  strictly 
greater  than  YMAX  In  a  case,  the  case  is 
rejected.  Values  up  to  "999999"  can  be 
specified  for  YMAX. 
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Itemized  Input  for  AID  (cont'd)  1 

Card 

Column(s) 

Use 

Description 

31-36 

Minimum 
Criterion 
Value : 

YMIN 

If  the  criterion  variable  value  is  strictly 
less  than  YMIN  in  a  case,  the  case  is 
rejected. 

37-42 

43-48 

Deletion 
Values : 
MD1.MD2 

If  the  criterion  variable  value  is  equal 

to  either  of  these  values,  the  case  is  / 

rejected.  If  the  use  of  deletion  values 

is  not  desired,  or  only  one  deletion  value 

is  desired,  setting  MD1  and/or  MD2  to 

values  outside  the  range  of  YMIN  to  YMAX 

deactivates  their  use. 

AID  End-Of-Job 

Card 

1 

Card  Type 

Must  contain  the  numeric  value  "9" 

Indicates  the  end  of  all  of  the  AID  control 
cards. 


APPENDIX  D 


SELECTED  AID  OUTPUT 
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